## Gathering the Data
The first step is to gather a large amount of data and to store it in a pandas dataframe.

In [1]:
import pandas as pd
import praw
import secrets
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.svm import SVC


In [2]:
user_agent = "Subreddit-Predictor 0.1 by /u/IsThisATrollBot"

reddit = praw.Reddit(
    client_id=secrets.client_ID,
    client_secret=secrets.client_secret,
    password=secrets.password,
    user_agent=user_agent,
    username=secrets.username,
)

Because pushshift is down, we are limited to the amount of data we can gather at a time. So we will choose posts from the 10 most popular subreddits.

In [3]:
# Start with a list of subreddits
top_subreddits = ['announcements', 'funny', 'AskReddit', 'dataisbeautiful', 'Awww', 'datascience', 'pics', 'science', 'worldnews', 'videos', 'AmItheAsshole']

In [4]:
# Create an empty list to store the posts
posts = []

# Iterate through the subreddits and get the last 1000 posts from each
for sub in top_subreddits:
    subreddit_posts = reddit.subreddit(sub).new(limit=1000)
    for post in subreddit_posts:
        posts.append(post)

In [5]:
# Create a list of dictionaries containing the data for each post
data = [{'id': post.id, 'title': post.title, 'subreddit': post.subreddit.display_name} for post in posts]

# Create a Pandas dataframe from the list of dictionaries
df = pd.DataFrame(data)


In [6]:
test_titles = ['Redditors of Reddit. What is your favorite piece of Reddit history?', 'WIBTA if I stole my younger brothers lunch money?', 'check out this cool video I found', 'asdf', 'cats are dangerous', 'new study shows cats are dangerous', 'reddit cool aita']
test_titles = pd.DataFrame({'title':test_titles})


# Main Subreddit Predictor Class

This will have as attributes the Feature Vectorizers and the Classifiers, which themselves are objects of other classes.

In [7]:
import pandas as pd
import praw
import secrets
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.svm import SVC


In [8]:
user_agent = "Subreddit-Predictor 0.1 by /u/IsThisATrollBot"

reddit = praw.Reddit(
    client_id=secrets.client_ID,
    client_secret=secrets.client_secret,
    password=secrets.password,
    user_agent=user_agent,
    username=secrets.username,
)

Because pushshift is down, we are limited to the amount of data we can gather at a time. So we will choose posts from the 10 most popular subreddits.

In [9]:
# Start with a list of subreddits
top_subreddits = ['announcements', 'funny', 'AskReddit', 'dataisbeautiful', 'Awww', 'datascience', 'pics', 'science', 'worldnews', 'videos', 'AmItheAsshole']

In [10]:
# Create an empty list to store the posts
posts = []

# Iterate through the subreddits and get the last 1000 posts from each
for sub in top_subreddits:
    subreddit_posts = reddit.subreddit(sub).new(limit=1000)
    for post in subreddit_posts:
        posts.append(post)

In [11]:
# Create a list of dictionaries containing the data for each post
data = [{'id': post.id, 'title': post.title, 'subreddit': post.subreddit.display_name} for post in posts]

# Create a Pandas dataframe from the list of dictionaries
df = pd.DataFrame(data)


In [12]:
test_titles = ['Redditors of Reddit. What is your favorite piece of Reddit history?', 'WIBTA if I stole my younger brothers lunch money?', 'check out this cool video I found', 'asdf', 'cats are dangerous', 'new study shows cats are dangerous', 'reddit cool aita']
test_titles = pd.DataFrame({'title':test_titles})


# Main Subreddit Predictor Class

This will have as attributes the Feature Vectorizers and the Classifiers, which themselves are objects of other classes.

# Class: Subreddit_Predictor

Objects of this class contain attributes and Methods that can be broken up into three categories: **Data**, **Collections**, and **Processing**

## Data
Pandas DataFrames and methods to update and clean the data.

**Attributes:**

| Name      |   Type    |             Description             |
|:----------|:---------:|:-----------------------------------|
| raw_data  | DataFrame |      The raw unprocessed data       |
| full_data | DataFrame |         The processed data          |
| X_train   | DataFrame | the X portion of the training data  |
| Y_train   | DataFrame | the Y portion of the training data  |
| X_test    | DataFrame |   the X portion of the test data    |
| Y_test    | DataFrame |   the Y portion of the test data    |

**Methods:**

| Name (with input/output typing) | Description                                                                                                                               |
|---------------------------------| ------------------------------------------------------------------------------------------------------------------------------------------|
| add_data(df: DataFrame)         | Updates the raw_data attribute                                                                                                            |
| ready_data()                    | Cleans the data and does a test train split. <br/> Overwrites the full_data attribute. <br/> Creates the X_train, Y_train, X_test, and Y_test attributes. |

## Collections
Contains dictionaries of vectorizers, classifiers, and models

**Attributes:**

| Name        | Type                  | Description                                                                                                                                |
|:------------|:----------------------|:-------------------------------------------------------------------------------------------------------------------------------------------|
| Vectorizers | dict | Dictionary of Vectorizer objects                                                                                                           |
| Classifiers | dict | Dictionary of Classifier objects                                                                                                           |
| Models      | dict | Dictionary of trained Classifier objects                                                                                                   |
| Models_info | dict        | Dictionary containing a description of each model in Models.                                                                               |
| Predictions | DataFrame   | A DataFrame with all the titles and actual subreddits in X_test and Y_test <br/> There is a column for each model that has the predictions |
| Results | dict | Dictionary of DataFrames for each model. Each row and column is a subreddit. Shows the number of false classifications |

**Methods:**

| Name (with input/output typing)                                                                                  | Description                                                                                                                                                                                                                                                        |
|------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| add_vectorizer(model: Vectorizer)                                                                                | Adds (key = model.name, value = model) to Vectorizers                                                                                                                                                                                                              |
| add_classifier(model: Classifier)                                                                                | Adds (key = model.name, value = model) to Classifiers                                                                                                                                                                                                              |
| train_model(<br/>modelName: str, <br/>vectorizerName: str, <br/>classifierName: str, <br/>description = '' :str) | Takes vectorizer and classifer from Vectorizers and Classifiers. <br/>Trains the classifier.<br/>Names and adds the trained model to Models.<br/>Adds the description text to Models_info.|
| test_model(modelName: str) |  Runs the model against X_test and Y_test. <br/> Updates Predictions and Results |

## Processing

**Methods:**

| Name (with input/output typing)                       | Description                                                                                   |
|-------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| predict(modelName:str, title: str, titles: iter[str]) | Given a model, enter a title or a list/dataframe of titles. Will return the model prediction. |
| compare(models: list[str])                            | Creates a bar chart comparing each of the models on each of the subreddits. |                  |

In [13]:
type(obj)

NameError: name 'obj' is not defined

In [None]:
class Subreddit_Predictor:
    """
    Objects of this class contain the following:

    Data - Pandas DataFrames and methods to update and clean the data.
    Attributes:
        obj.raw_data, full_data, X_train, Y_train, X_test, Y_test
    Methods:
        obj.add_data(df), ready_data()

    Containers - Dictionaries which contain other objects. The key to each dictionary is always the name.
    Attributes:
        obj.Vectorizers, Classifiers, Models, Models_info
    Methods:
        obj.add_vectorizer(model: Vectorizer), obj.add_classifier(model: Classifier), obj.train_model(vectorizerName: str, classifierName: str, modelName: str)

    Analyzer - The visual representation of the results of the different models.




    """
    def __init__(self):
        self.raw_data = pd.DataFrame({'id':[], 'title':[], 'subreddit':[]})
        self.subreddits = []
        self.data = pd.DataFrame({'id':[], 'title':[], 'subreddit':[]})
        self.Feature_Vectors = {}
        self.Embedding = {}
        self.Title_Vectorizers = {}
        self.Classifiers = {}
        self.Models = {}
        self.Models_info = {}

    def add_data(self, df):
        """df is a pandas DataFrame with columns={'title':[], 'subreddit':[]}. It will be merged with the existing raw_data"""
        self.raw_data = pd.concat([self.raw_data, df]).drop_duplicates(subset='id')

    def clean_data(self):
        """Cleans the data in raw_data and updates self.data"""

        df = self.raw_data

        # Remove all non-alpha-numeric characters
        df['title'] = df['title'].str.replace(r'[^a-zA-Z0-9 ]', '', regex = True)

        # Make all the text lowercase
        df['title'] = df['title'].str.lower()

        # Remove empty rows
        df['title'] = df['title'].str.strip()
        filter = df['title'] == ''
        df = df.drop(df[filter].index)

        # Store it as
        self.data = df

        #update the subreddits attribute
        self.subreddits = self.data['subreddit'].unique().tolist()

    def ready_data(self, test_size = .2, seed = 42):
        """Splits and encodes the data. Saves is in X_train, Y_train, X_test, Y_test."""

        # Change the index
        self.data = self.data.set_index('id')

        # Encode the subreddits
        self._le = LabelEncoder()
        self.data['subreddit_num'] = self._le.fit_transform(self.data['subreddit'])

        # Split the data
        self.X_train, self.X_test, self.Y_train, self.Y_test = train_test_split(self.data['title'], self._le.fit_transform(self.data['subreddit']), test_size=test_size, random_state = seed)

    def add_title_vectorizer(self, title_vectorizer):
        """This is how we add a title_vectorizer to our collection"""
        title_vectorizer.train(self.X_train)
        self.Title_Vectorizers[title_vectorizer.featureName] = title_vectorizer
        self.Feature_Vectors[title_vectorizer.featureName] = title_vectorizer.vectorize(self.X_train)

    def add_classifier(self, classifier):
        """We add the classifier to our collection, self.Classifiers"""
        self.Classifiers[classifier.classifierName] = classifier

    def train_model(self, modelName, featureName, classifierName, description = ''):
        """
        :param modelName: The name of this model
        :param featureName: Which feature vectors are we using?
        :param classifierName: Which classifier are we using?
        :param description: Write a short discription of the model (optional).
        :return: Adds a trained object of the classifier class to self.Models
        """

        self.Models_info[modelName] = {'featureName':featureName, 'classifierName':classifierName, 'description':description}

        X_train = self.Feature_Vectors[featureName]
        Y_train = self.Y_train
        classifier = self.Classifiers[classifierName]
        classifier.train(X_train, Y_train)

        self.Models[modelName] = classifier


    def predictions(self, modelName, titles):
        """
        :param modelName: Which model are we using?
        :param titles: A list or series of titles
        :return: A data frame of 'title' and 'prediction'
        """

        model = self.Models[modelName]

        featureName = self.Models_info[modelName]['featureName']
        vectorizer = self.Title_Vectorizers[featureName]

        title_vectors = vectorizer.vectorize(titles)

        df = model.predict(title_vectors)
        #df['prediction'] = self._le.inverse_transform(df['prediction'])

        return df




    def generate_features(self, featureName):
        """Generates the features using the different methods we have created"""

        if featureName == 'BoW':
            self.Embedding['BoW'] = CountVectorizer()
            self.Features['BoW'] = self.Embedding['BoW'].fit_transform(self.X_train)

        if featureName == 'D2V':

            # Create a list of TaggedDocument objects from the titles
            X_train_tagged = self.X_train.tolist()
            X_train_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_train_tagged)]
            X_test_tagged = self.X_test.tolist()
            X_test_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_test_tagged)]

            model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
            model_dbow.build_vocab(X_train_tagged)

            # Train the model
            model_dbow.train(X_train_tagged, total_examples=model_dbow.corpus_count, epochs=100)

            # Get the vectorized titles from the doc2vec model
            vectors = [model_dbow.infer_vector(title.split()) for title in X_train.tolist()]

            # Add the vectors to the dataframe as a new column
            df_new = pd.DataFrame({'title':X_train, 'vector': vectors})
            df_new

    def vectorize(self, featureName, x):
        """Turns a sentence or list of sentences into a feature vectors"""

        if type(x) == str: return self.vectorize(featureName, [x])

        else:
            if featureName == 'BoW':
                return self.Embedding['BoW'].transform(x).toarray()



Example

In [None]:
obj = Subreddit_Predictor()
obj.add_data(df)
obj.clean_data()
obj.ready_data(test_size=.3, seed=29)

In [None]:
obj.Models_info

In [None]:
obj.predictions('BoW+SVM', test_titles)

# Title Vectorizer Class

This will have all of the different vectorizers. All of the different ways to embed titles.
A key feature of this class is that there are functions which need to be added later.

In [None]:
class Title_Vectorizer:
    """This class is to hold all of the Title Vectorizers, like Bag-of-Words and Doc2Vec. Each vectorizer is a specific object. The class methods all have the same input/output."""
    def __init__(self, featureName):
        self.featureName = featureName
        self.description = "Description goes here"

    def train(self, X_train):
        """Inputs the training data. Creates the self.model"""

        self.model = self._train(X_train)

    def _train(self, X_train):
        """Just a place holder for the actual function"""
        pass

    def vectorize(self, df_titles):
        """Given a data frame or series with only titles, will return a df of all of the features, indexed by id. The actual function will be added to each object."""

        return self._vectorize(df_titles, self.model)

    def _vectorize(self, df_titles, model):
        """Just a place holder for the actual function."""
        pass



### Example: Bag-of-Words

In [None]:
BoW_model = Title_Vectorizer('BoW')

def _BoW_vectorize(df_titles, model):
    """I think I need to drop every word that's not in the vocabulary."""

    if type(df_titles) == pd.core.frame.DataFrame:
        titles = df_titles['title']
    else:
        titles = df_titles

    vocab = model.vocabulary_

    titles = titles.apply(lambda s: ' '.join(set(s.split()).intersection(vocab)))
    temp = model.transform(titles)
    temp = temp.toarray()
    temp = pd.DataFrame(temp)
    temp['id'] =df_titles.index
    temp = temp.set_index('id')
    return temp

def _BoW_train(X_train):
    model = CountVectorizer()
    model.fit_transform(X_train)
    return model

BoW_model._vectorize = _BoW_vectorize
BoW_model._train = _BoW_train

obj.add_title_vectorizer(BoW_model)

In [None]:
BoW_model.vectorize(test_titles)

### Example: Doc2Vec

In [None]:
D2V_model = Title_Vectorizer('D2V')
#D2V_model.params = {'dm':0, 'vector_size':300, 'negative':5, 'hs':0, 'min_count':2, 'sample':0, 'epochs':100}

def _D2V_train(X_train):

    X_train_tagged = X_train.tolist()
    X_train_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_train_tagged)]

    model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0)
    model_dbow.build_vocab(X_train_tagged)

    # Train the model
    model_dbow.train(X_train_tagged, total_examples=model_dbow.corpus_count, epochs=100)

    return model_dbow

def _D2V_vectorize(df_titles, model):

    vectors = [model.infer_vector(titl.split()) for titl in df_titles.tolist()]
    df_new = pd.DataFrame({'title':df_titles, 'vector': vectors})
    df_new =df_new['vector'].apply(lambda x: pd.Series(x))

    return df_new

D2V_model._train = _D2V_train
D2V_model._vectorize = _D2V_vectorize

obj.add_title_vectorizer(D2V_model)

# Classifiers

This is the class that holds the classifiers, like XGBoost and Support Vector Machines

In [None]:
class classifier:
    """This is the class the holds the classifiers"""

    def __init__(self, classifierName):
        self.classifierName = classifierName

    def train(self, X_train, Y_train):
        """Input the X and Y training data. Then update the model"""

        self.model = self._train(X_train, Y_train)

    def _train(self, X_train, Y_train):
        """Where the real function is stored"""
        pass

    def predict(self, title_vectors):
        """
        :param title_vectors: A pandas dataframe of the vectorized titles
        :return: A pandas series with the predictions
        """

        return self._predict(title_vectors, self.model)

    def _predict(self, titles, model):
        """where the actual function is stored"""
        pass


### Example: Support Vector Machine

In [None]:
SVM_model = classifier('SVM')

def _SVM_train(X_train, Y_train):
    model = SVC()
    model.fit(X_train, Y_train)
    return model

def _SVM_predict(title_vectors, model):
    """enter a list or series or data frame of titles. Outputs prediction in a dataframe"""

    df = model.predict(title_vectors)
    print(df)
    return df

SVM_model._train = _SVM_train
SVM_model._predict = _SVM_predict

In [None]:
obj.add_classifier(SVM_model)

In [None]:
obj.train_model('BoW+SVM', 'BoW', 'SVM', description= 'Just a quick test')

In [None]:
obj.predictions('BoW+SVM', obj.X_test)

In [None]:
test_titles

In [None]:

# Test the model on some new data
new_titles = ['Redditors of Reddit. What is your favorite piece of Reddit history?', 'WIBTA if I stole my younger brothers lunch money?', 'check out this cool video I found', 'asdf', 'cats are dangerous', 'new study shows cats are dangerous']
new_vectors = Embedding[featureName].transform(new_titles)

new_predictions = Models[(featureName, classifierName)].predict(new_vectors)

output = pd.DataFrame({'title': new_titles, 'Prediction':new_predictions})
output['Prediction'] = le.inverse_transform(output['Prediction'])
output



Models[(featureName, classifierName)] = SVC()
Models[(featureName, classifierName)].fit(Features[featureName], Y_train)

In [None]:
D2V_model.vectorize(obj.X_train)

In [None]:
D2V_model.vectorize(obj.X_test)


In [None]:

D2V_model.vectorize(test_titles['title'])

In [None]:
           # Create a list of TaggedDocument objects from the titles
X_train_tagged = self.X_train.tolist()
X_train_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_train_tagged)]
X_test_tagged = self.X_test.tolist()
X_test_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_test_tagged)]

model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0)
model_dbow.build_vocab(X_train_tagged)

# Train the model
model_dbow.train(X_train_tagged, total_examples=model_dbow.corpus_count, epochs=100)

# Get the vectorized titles from the doc2vec model
vectors = [model_dbow.infer_vector(title.split()) for title in X_train.tolist()]

# Add the vectors to the dataframe as a new column
df_new = pd.DataFrame({'title':X_train, 'vector': vectors})
df_new

In [None]:
BoW_model = Title_Vectorizer('BoW')
BoW_model._vectorize = _BoW_vectorize
BoW_model._train = _BoW_train
#BoW_model.train(obj.X_train)
#BoW_model.vectorize(obj.X_train)


In [None]:
#_BoW_train(obj.X_train)
BoW_model._train = _BoW_train
BoW_model.train(obj.X_train)
type(BoW_model.model)

In [None]:
type(BoW_model.model)

In [None]:
BoW_model.model.transform(list(test_titles['title'])).toarray()

In [None]:
import pandas as pd

# Create a sample pandas series
s = pd.Series(['I love dogs', 'I hate cats', 'I like turtles'])

# Create a vocabulary
vocab = ['I', 'love', 'hate', 'like']

# Remove words from the sentences that are not in the vocabulary
filtered_s = s.apply(lambda x: ' '.join([word for word in x.split() if word in vocab]))

# Print the filtered series
print(filtered_s)
# Remove words from the sentences that are not in the vocabulary
filtered_s = s.apply(lambda x: ' '.join(set(x.split()).intersection(vocab)))

# Print the filtered series
print(filtered_s)


In [None]:
x = CountVectorizer()
x.fit_transform(obj.X_train)
vocab = x.vocabulary_
'im' in vocab

In [None]:
test_titles

In [None]:
BoW_model.vectorize(pd.DataFrame({'title':test_titles}))

In [None]:
pd.DataFrame(x, obj.X_train.index).info()

In [None]:
pd.DataFrame({'title':obj.X_train, 'vector': x})

In [None]:
obj = Subreddit_Predictor()
obj.add_data(df)
obj.clean_data()
obj.ready_data(test_size=.3, seed=29)
obj.add_title_vectorizer(BoW_model)


In [None]:
obj.Feature_Vectors['BoW']

In [None]:
def foo(x):
    print ('hello',x)

obj.fun = foo

obj.fun(2)

In [None]:




# Convert the labels to numerical values
le = LabelEncoder()
df['subreddit_num'] = le.fit_transform(df['subreddit'])

df = df.drop(columns=['subreddit'])

#df['subreddit'] = le.inverse_transform(df['subreddit_num'])

df


In [None]:
df_new = pd.DataFrame({'id':['pg006s'], 'title':[a], 'subreddit':['announcements']}).set_index('id')

In [None]:
pd.concat([df_new, df]).drop_duplicates(keep = False)

In [None]:
df.drop_duplicates(keep = 'first')

In [None]:
import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [4, 5, 5, 6, 6], 'C': [7, 8, 8, 9, 9]})

# Find duplicate rows
duplicate_rows = df[df.duplicated()]

# Print the duplicate rows
print(duplicate_rows)


In [None]:
df[df.duplicated()]

In [None]:
class Subreddit_Predictor:
    """
    Objects of this class contain the following:

    Data - Pandas DataFrames and methods to update and clean the data.
    Attributes:
        obj.raw_data, full_data, X_train, Y_train, X_test, Y_test
    Methods:
        obj.add_data(df), ready_data()

    Containers - Dictionaries which contain other objects. The key to each dictionary is always the name.
    Attributes:
        obj.Vectorizers, Classifiers, Models, Models_info
    Methods:
        obj.add_vectorizer(model: Vectorizer), obj.add_classifier(model: Classifier), obj.train_model(vectorizerName: str, classifierName: str, modelName: str)

    Analyzer - The visual representation of the results of the different models.




    """
    def __init__(self):
        self.raw_data = pd.DataFrame({'id':[], 'title':[], 'subreddit':[]})
        self.subreddits = []
        self.data = pd.DataFrame({'id':[], 'title':[], 'subreddit':[]})
        self.Feature_Vectors = {}
        self.Embedding = {}
        self.Title_Vectorizers = {}
        self.Classifiers = {}
        self.Models = {}
        self.Models_info = {}

    def add_data(self, df):
        """df is a pandas DataFrame with columns={'title':[], 'subreddit':[]}. It will be merged with the existing raw_data"""
        self.raw_data = pd.concat([self.raw_data, df]).drop_duplicates(subset='id')

    def clean_data(self):
        """Cleans the data in raw_data and updates self.data"""

        df = self.raw_data

        # Remove all non-alpha-numeric characters
        df['title'] = df['title'].str.replace(r'[^a-zA-Z0-9 ]', '', regex = True)

        # Make all the text lowercase
        df['title'] = df['title'].str.lower()

        # Remove empty rows
        df['title'] = df['title'].str.strip()
        filter = df['title'] == ''
        df = df.drop(df[filter].index)

        # Store it as
        self.data = df

        #update the subreddits attribute
        self.subreddits = self.data['subreddit'].unique().tolist()

    def ready_data(self, test_size = .2, seed = 42):
        """Splits and encodes the data. Saves is in X_train, Y_train, X_test, Y_test."""

        # Change the index
        self.data = self.data.set_index('id')

        # Encode the subreddits
        self._le = LabelEncoder()
        self.data['subreddit_num'] = self._le.fit_transform(self.data['subreddit'])

        # Split the data
        self.X_train, self.X_test, self.Y_train, self.Y_test = train_test_split(self.data['title'], self._le.fit_transform(self.data['subreddit']), test_size=test_size, random_state = seed)

    def add_title_vectorizer(self, title_vectorizer):
        """This is how we add a title_vectorizer to our collection"""
        title_vectorizer.train(self.X_train)
        self.Title_Vectorizers[title_vectorizer.featureName] = title_vectorizer
        self.Feature_Vectors[title_vectorizer.featureName] = title_vectorizer.vectorize(self.X_train)

    def add_classifier(self, classifier):
        """We add the classifier to our collection, self.Classifiers"""
        self.Classifiers[classifier.classifierName] = classifier

    def train_model(self, modelName, featureName, classifierName, description = ''):
        """
        :param modelName: The name of this model
        :param featureName: Which feature vectors are we using?
        :param classifierName: Which classifier are we using?
        :param description: Write a short discription of the model (optional).
        :return: Adds a trained object of the classifier class to self.Models
        """

        self.Models_info[modelName] = {'featureName':featureName, 'classifierName':classifierName, 'description':description}

        X_train = self.Feature_Vectors[featureName]
        Y_train = self.Y_train
        classifier = self.Classifiers[classifierName]
        classifier.train(X_train, Y_train)

        self.Models[modelName] = classifier


    def predictions(self, modelName, titles):
        """
        :param modelName: Which model are we using?
        :param titles: A list or series of titles
        :return: A data frame of 'title' and 'prediction'
        """

        model = self.Models[modelName]

        featureName = self.Models_info[modelName]['featureName']
        vectorizer = self.Title_Vectorizers[featureName]

        title_vectors = vectorizer.vectorize(titles)

        df = model.predict(title_vectors)
        #df['prediction'] = self._le.inverse_transform(df['prediction'])

        return df




    def generate_features(self, featureName):
        """Generates the features using the different methods we have created"""

        if featureName == 'BoW':
            self.Embedding['BoW'] = CountVectorizer()
            self.Features['BoW'] = self.Embedding['BoW'].fit_transform(self.X_train)

        if featureName == 'D2V':

            # Create a list of TaggedDocument objects from the titles
            X_train_tagged = self.X_train.tolist()
            X_train_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_train_tagged)]
            X_test_tagged = self.X_test.tolist()
            X_test_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_test_tagged)]

            model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
            model_dbow.build_vocab(X_train_tagged)

            # Train the model
            model_dbow.train(X_train_tagged, total_examples=model_dbow.corpus_count, epochs=100)

            # Get the vectorized titles from the doc2vec model
            vectors = [model_dbow.infer_vector(title.split()) for title in X_train.tolist()]

            # Add the vectors to the dataframe as a new column
            df_new = pd.DataFrame({'title':X_train, 'vector': vectors})
            df_new

    def vectorize(self, featureName, x):
        """Turns a sentence or list of sentences into a feature vectors"""

        if type(x) == str: return self.vectorize(featureName, [x])

        else:
            if featureName == 'BoW':
                return self.Embedding['BoW'].transform(x).toarray()



Example

In [None]:
obj = Subreddit_Predictor()
obj.add_data(df)
obj.clean_data()
obj.ready_data(test_size=.3, seed=29)

In [None]:
obj.Models_info

In [None]:
obj.predictions('BoW+SVM', test_titles)

# Title Vectorizer Class

This will have all of the different vectorizers. All of the different ways to embed titles.
A key feature of this class is that there are functions which need to be added later.

In [None]:
class Title_Vectorizer:
    """This class is to hold all of the Title Vectorizers, like Bag-of-Words and Doc2Vec. Each vectorizer is a specific object. The class methods all have the same input/output."""
    def __init__(self, featureName):
        self.featureName = featureName
        self.description = "Description goes here"

    def train(self, X_train):
        """Inputs the training data. Creates the self.model"""

        self.model = self._train(X_train)

    def _train(self, X_train):
        """Just a place holder for the actual function"""
        pass

    def vectorize(self, df_titles):
        """Given a data frame or series with only titles, will return a df of all of the features, indexed by id. The actual function will be added to each object."""

        return self._vectorize(df_titles, self.model)

    def _vectorize(self, df_titles, model):
        """Just a place holder for the actual function."""
        pass



### Example: Bag-of-Words

In [None]:
BoW_model = Title_Vectorizer('BoW')

def _BoW_vectorize(df_titles, model):
    """I think I need to drop every word that's not in the vocabulary."""

    if type(df_titles) == pd.core.frame.DataFrame:
        titles = df_titles['title']
    else:
        titles = df_titles

    vocab = model.vocabulary_

    titles = titles.apply(lambda s: ' '.join(set(s.split()).intersection(vocab)))
    temp = model.transform(titles)
    temp = temp.toarray()
    temp = pd.DataFrame(temp)
    temp['id'] =df_titles.index
    temp = temp.set_index('id')
    return temp

def _BoW_train(X_train):
    model = CountVectorizer()
    model.fit_transform(X_train)
    return model

BoW_model._vectorize = _BoW_vectorize
BoW_model._train = _BoW_train

obj.add_title_vectorizer(BoW_model)

In [None]:
BoW_model.vectorize(test_titles)

### Example: Doc2Vec

In [None]:
D2V_model = Title_Vectorizer('D2V')
#D2V_model.params = {'dm':0, 'vector_size':300, 'negative':5, 'hs':0, 'min_count':2, 'sample':0, 'epochs':100}

def _D2V_train(X_train):

    X_train_tagged = X_train.tolist()
    X_train_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_train_tagged)]

    model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0)
    model_dbow.build_vocab(X_train_tagged)

    # Train the model
    model_dbow.train(X_train_tagged, total_examples=model_dbow.corpus_count, epochs=100)

    return model_dbow

def _D2V_vectorize(df_titles, model):

    vectors = [model.infer_vector(titl.split()) for titl in df_titles.tolist()]
    df_new = pd.DataFrame({'title':df_titles, 'vector': vectors})
    df_new =df_new['vector'].apply(lambda x: pd.Series(x))

    return df_new

D2V_model._train = _D2V_train
D2V_model._vectorize = _D2V_vectorize

obj.add_title_vectorizer(D2V_model)

# Classifiers

This is the class that holds the classifiers, like XGBoost and Support Vector Machines

In [None]:
class classifier:
    """This is the class the holds the classifiers"""

    def __init__(self, classifierName):
        self.classifierName = classifierName

    def train(self, X_train, Y_train):
        """Input the X and Y training data. Then update the model"""

        self.model = self._train(X_train, Y_train)

    def _train(self, X_train, Y_train):
        """Where the real function is stored"""
        pass

    def predict(self, title_vectors):
        """
        :param title_vectors: A pandas dataframe of the vectorized titles
        :return: A pandas series with the predictions
        """

        return self._predict(title_vectors, self.model)

    def _predict(self, titles, model):
        """where the actual function is stored"""
        pass


### Example: Support Vector Machine

In [None]:
SVM_model = classifier('SVM')

def _SVM_train(X_train, Y_train):
    model = SVC()
    model.fit(X_train, Y_train)
    return model

def _SVM_predict(title_vectors, model):
    """enter a list or series or data frame of titles. Outputs prediction in a dataframe"""

    df = model.predict(title_vectors)
    print(df)
    return df

SVM_model._train = _SVM_train
SVM_model._predict = _SVM_predict

In [None]:
obj.add_classifier(SVM_model)

In [None]:
obj.train_model('BoW+SVM', 'BoW', 'SVM', description= 'Just a quick test')

In [None]:
obj.predictions('BoW+SVM', obj.X_test)

In [None]:
test_titles

In [None]:

# Test the model on some new data
new_titles = ['Redditors of Reddit. What is your favorite piece of Reddit history?', 'WIBTA if I stole my younger brothers lunch money?', 'check out this cool video I found', 'asdf', 'cats are dangerous', 'new study shows cats are dangerous']
new_vectors = Embedding[featureName].transform(new_titles)

new_predictions = Models[(featureName, classifierName)].predict(new_vectors)

output = pd.DataFrame({'title': new_titles, 'Prediction':new_predictions})
output['Prediction'] = le.inverse_transform(output['Prediction'])
output



Models[(featureName, classifierName)] = SVC()
Models[(featureName, classifierName)].fit(Features[featureName], Y_train)

In [None]:
D2V_model.vectorize(obj.X_train)

In [None]:
D2V_model.vectorize(obj.X_test)


In [None]:

D2V_model.vectorize(test_titles['title'])

In [None]:
           # Create a list of TaggedDocument objects from the titles
X_train_tagged = self.X_train.tolist()
X_train_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_train_tagged)]
X_test_tagged = self.X_test.tolist()
X_test_tagged = [TaggedDocument(words=title.split(), tags=[str(i)]) for i, title in enumerate(X_test_tagged)]

model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0)
model_dbow.build_vocab(X_train_tagged)

# Train the model
model_dbow.train(X_train_tagged, total_examples=model_dbow.corpus_count, epochs=100)

# Get the vectorized titles from the doc2vec model
vectors = [model_dbow.infer_vector(title.split()) for title in X_train.tolist()]

# Add the vectors to the dataframe as a new column
df_new = pd.DataFrame({'title':X_train, 'vector': vectors})
df_new

In [None]:
BoW_model = Title_Vectorizer('BoW')
BoW_model._vectorize = _BoW_vectorize
BoW_model._train = _BoW_train
#BoW_model.train(obj.X_train)
#BoW_model.vectorize(obj.X_train)


In [None]:
#_BoW_train(obj.X_train)
BoW_model._train = _BoW_train
BoW_model.train(obj.X_train)
type(BoW_model.model)

In [None]:
type(BoW_model.model)

In [None]:
BoW_model.model.transform(list(test_titles['title'])).toarray()

In [None]:
import pandas as pd

# Create a sample pandas series
s = pd.Series(['I love dogs', 'I hate cats', 'I like turtles'])

# Create a vocabulary
vocab = ['I', 'love', 'hate', 'like']

# Remove words from the sentences that are not in the vocabulary
filtered_s = s.apply(lambda x: ' '.join([word for word in x.split() if word in vocab]))

# Print the filtered series
print(filtered_s)
# Remove words from the sentences that are not in the vocabulary
filtered_s = s.apply(lambda x: ' '.join(set(x.split()).intersection(vocab)))

# Print the filtered series
print(filtered_s)


In [None]:
x = CountVectorizer()
x.fit_transform(obj.X_train)
vocab = x.vocabulary_
'im' in vocab

In [None]:
test_titles

In [None]:
BoW_model.vectorize(pd.DataFrame({'title':test_titles}))

In [None]:
pd.DataFrame(x, obj.X_train.index).info()

In [None]:
pd.DataFrame({'title':obj.X_train, 'vector': x})

In [None]:
obj = Subreddit_Predictor()
obj.add_data(df)
obj.clean_data()
obj.ready_data(test_size=.3, seed=29)
obj.add_title_vectorizer(BoW_model)


In [None]:
obj.Feature_Vectors['BoW']

In [None]:
def foo(x):
    print ('hello',x)

obj.fun = foo

obj.fun(2)

In [None]:




# Convert the labels to numerical values
le = LabelEncoder()
df['subreddit_num'] = le.fit_transform(df['subreddit'])

df = df.drop(columns=['subreddit'])

#df['subreddit'] = le.inverse_transform(df['subreddit_num'])

df


In [None]:
df_new = pd.DataFrame({'id':['pg006s'], 'title':[a], 'subreddit':['announcements']}).set_index('id')

In [None]:
pd.concat([df_new, df]).drop_duplicates(keep = False)

In [None]:
df.drop_duplicates(keep = 'first')

In [None]:
import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [4, 5, 5, 6, 6], 'C': [7, 8, 8, 9, 9]})

# Find duplicate rows
duplicate_rows = df[df.duplicated()]

# Print the duplicate rows
print(duplicate_rows)


In [None]:
df[df.duplicated()]