# Predicting a certain subreddit

This notebook looks at two subreddits of two political candidates, Bernie Sanders and Joe Biden. By comparing the titles of the posts of each subreddit, I created a model that would predict which title came from a certain subreddit. Using the API from the reddit website I was able to pull the data I needed from the subreddits to create a model from the titles on the page.

In [3]:
#Using the requests libaray, I pulled the data from the subreddit websites. Below I have pulled the data from the
#Bernie Sanders subreddit website on Reddit.

import requests

bern_url = 'https://api.pushshift.io/reddit/search/submission?subreddit=SandersForPresident&size=1000'

res_1 = requests.get(bern_url)

res_1.status_code

200

In [4]:
#The API link brings the website in as a json file, so below I coverted the json file as a dictionary.
data = res_1.json()

In [5]:
#Below is the first dictionary of the list of dictionaries.
data['data'][0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'fenspyre',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_is6bz',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1580320506,
 'domain': 'self.SandersForPresident',
 'full_link': 'https://www.reddit.com/r/SandersForPresident/comments/evqd4x/my_fundraising_pledge/',
 'gildings': {},
 'id': 'evqd4x',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_st

## Exploring the Data: Bernie Sanders Subreddit

In order to explore the data from the subreddit, I needed to convert the list of dictionaries into a dataframe, and then I wanted to see what the top 2 word phrases were from the title's that were read. In order to see the two word phrases I needed to count vectorize the data. 

In [6]:
import pandas as pd

bernie = pd.DataFrame(data['data'])

  return f(*args, **kwds)
  return f(*args, **kwds)


In [7]:
bernie_df = bernie[['title', 'subreddit']]

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
bernie_words = CountVectorizer(stop_words ='english', ngram_range = (2,2))

X = bernie['title']
y = bernie['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

bernie_words.fit(X_train)

  return f(*args, **kwds)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(2, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [9]:
bern_words = pd.DataFrame(bernie_words.transform(X_train).todense(), columns = bernie_words.get_feature_names())

##### Top two-words phrases from Bernie Sanders subreddit

In [10]:
bern_words.sum().sort_values(ascending = False).head(25)


bernie sanders           111
joe biden                 20
new hampshire             11
bernie campaign           10
tom perez                  8
iowa caucus                8
poll sanders               8
days iowa                  8
iowa poll                  8
iowa caucuses              7
poll biden                 7
sanders campaign           7
warren 15                  7
ads bernie                 6
democratic operatives      6
attack ads                 6
sanders iowa               6
social security            6
2008 vs                    6
people perfume             6
super pac                  6
super tuesday              6
support bernie             6
family friends             6
new deal                   5
dtype: int64

### Joe Biden subreddit

Below I used the same work flow as reading in the Bernie Sanders subreddit to read in the Joe Biden subreddit and explored the data in the same way as I did for the Bernie Sanders data. I wanted to see what the top two word phrases of the Joe Biden data was in order to compare it to the Sanders subreddit.

In [11]:
biden_url = 'https://api.pushshift.io/reddit/search/submission?subreddit=JoeBiden&size=1000'

res_2 = requests.get(biden_url)

In [12]:
res_2.status_code

200

In [13]:
biden_data = res_2.json()

In [14]:
biden_data['data'][0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'Fiery1Phoenix',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_udkhn',
 'author_patreon_flair': False,
 'author_premium': True,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1580320022,
 'domain': 'projects.fivethirtyeight.com',
 'full_link': 'https://www.reddit.com/r/JoeBiden/comments/evq8xb/biden_currently_has_a_52_chance_of_winning_a/',
 'gildings': {},
 'id': 'evq8xb',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,


### Created a dataframe of Joe Biden subreddit

In [15]:
biden = pd.DataFrame(biden_data['data'])

biden_df = biden[['title', 'subreddit']]

In [16]:
biden_words = CountVectorizer(stop_words ='english', ngram_range = (2,2))

X = biden['title']
y = biden['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

biden_words.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(2, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [17]:
biden_words = pd.DataFrame(biden_words.transform(X_train).todense(), columns = biden_words.get_feature_names())

### Explored the top two phrases of Joe Biden data

In [18]:
biden_words.sum().sort_values(ascending = False).head(25)

joe biden                  251
daily roundtable            34
poll biden                  33
endorses joe                26
biden president             24
roundtable january          24
bernie sanders              22
donald trump                18
biden leads                 15
biden twitter               15
social security             15
democratic presidential     14
national poll               12
south carolina              11
foreign policy              11
new hampshire               10
2020 democratic             10
warren 15                   10
buttigieg bloomberg         10
democratic primary          10
roundtable december         10
vice president              10
biden holds                  9
biden says                   9
poll shows                   9
dtype: int64

### Combined Bernie and Biden dataframes

I combined the dataframes, from the Bernie and Biden subreddits, that I created earlier in order to make a model that will predict which titles come from which subreddit.

In [19]:
bern_bid = pd.concat([bernie_df, biden_df])

bern_bid

Unnamed: 0,title,subreddit
0,My Fundraising Pledge,SandersForPresident
1,"Google, I think your bias is showing",SandersForPresident
2,New Bernie Sanders Campaign Ad shows how he's ...,SandersForPresident
3,🥳🥳🥳🥳,SandersForPresident
4,"Texas Poll: Trump 50%, Sanders 47% (Better tha...",SandersForPresident
...,...,...
995,Joe Biden unveils new Texas endorsements as he...,JoeBiden
996,Is Biden unwell?,JoeBiden
997,Cesar Torres on Twitter: Had the pleasure to s...,JoeBiden
998,538 just released their primary polling model,JoeBiden


In [20]:
#Here I needed change the y varaible into an interger in order to set it up for creating a model.

bern_bid['subreddit'] = bern_bid['subreddit'].map({'SandersForPresident': 0, 'JoeBiden': 1})

In [21]:
bern_bid

Unnamed: 0,title,subreddit
0,My Fundraising Pledge,0
1,"Google, I think your bias is showing",0
2,New Bernie Sanders Campaign Ad shows how he's ...,0
3,🥳🥳🥳🥳,0
4,"Texas Poll: Trump 50%, Sanders 47% (Better tha...",0
...,...,...
995,Joe Biden unveils new Texas endorsements as he...,1
996,Is Biden unwell?,1
997,Cesar Torres on Twitter: Had the pleasure to s...,1
998,538 just released their primary polling model,1


In [22]:
bern_bid.isnull().sum()

title        0
subreddit    0
dtype: int64

## Making a Logistic Regression, KNeighbors, and Multinomial Model

In [23]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

In [24]:
X = bern_bid['title']
y = bern_bid['subreddit']

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

##### Created a pipeline for Logistic Regression and parameters for the models in order to find the best model

In [26]:
pipe_bb = Pipeline([
    ('cvec', CountVectorizer()),
    ('logr', LogisticRegressionCV(solver = 'liblinear', cv = 5))
])

In [27]:
pipe_params_bb = {
    'cvec__max_features': [100, 200, 300, 400, 500],
    'cvec__max_df': [.98, .95],
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1)],
    'cvec__stop_words': ['english']
}

##### Below is the GridSearch for the Logistic Regression, and scores for the train and testing data

In [28]:
gs_bb = GridSearchCV(pipe_bb, pipe_params_bb, cv = 5)

gs_bb.fit(X_train, y_train)

print(gs_bb.score(X_train, y_train))
print(gs_bb.score(X_test, y_test))

0.9
0.876


In [29]:
gs_bb.best_params_

{'cvec__max_df': 0.98,
 'cvec__max_features': 400,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english'}

##### Below is the KNeighbors model along with the scores for the model

In [30]:
from sklearn.neighbors import KNeighborsClassifier

In [31]:
pipe_bb_knn = Pipeline([
    ('cvec', CountVectorizer()),
    ('knn', KNeighborsClassifier())
])

In [32]:
gs_knn = GridSearchCV(pipe_bb_knn, pipe_params_bb, cv = 5)

In [33]:
gs_knn.fit(X_train, y_train);

In [34]:
gs_knn.best_params_

{'cvec__max_df': 0.98,
 'cvec__max_features': 200,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english'}

In [35]:
gs_knn.best_score_

0.8486666666666667

In [36]:
gs_knn.score(X_test, y_test)

0.856

##### Below is the pipeline for the Mutlinomial model along with the scores for the model

In [37]:
pipe_bb_mnb = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])

gs_mnb = GridSearchCV(pipe_bb_mnb, pipe_params_bb, cv = 5)

gs_mnb.fit(X_train, y_train);

In [38]:
gs_mnb.best_params_

{'cvec__max_df': 0.98,
 'cvec__max_features': 500,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english'}

In [39]:
gs_mnb.best_score_

0.8273333333333334

In [40]:
gs_mnb.score(X_test, y_test)

0.83

##### Below is the pipeline for the Random Forest along with the scores for the model

In [41]:
from sklearn.ensemble import RandomForestClassifier

In [52]:
pipe_bb_rf = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier(n_estimators = 100))
])

gs_rf = GridSearchCV(pipe_bb_rf, pipe_params_bb, cv = 5)

gs_rf.fit(X_train, y_train);

In [53]:
gs_rf.score(X_train, y_train)

0.9793333333333333

In [54]:
gs_rf.score(X_test, y_test)

0.88

In [55]:
gs_rf.best_params_

{'cvec__max_df': 0.95,
 'cvec__max_features': 500,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english'}