# Using Reddit's API to Predict Classes

# Executive Summary

#### In this notebook, I aim to answer the question: what characteristics of a post on Reddit contribute most to what subreddit it belongs to?
#### Specifically, when comparing two subreddits, what words in title let you know which subreddit is from?

#### The two subreddits I chose to compare were the NBA (National Basketball Association) and the 76ers ( a team in the NBA). I did this since it seemed like a good edge-case to test. Both subreddits have similar subject matter which might lead to a reduced accuracy score but it would be interesting to see how well the models perform. 

#### In order to even start answering this question, we must first gather the data we wish to use from each subreddit. In order to do this we use the requests and json library in python to extract data from the json reddits pages. In this part of the project I had to explore the json files in order to find where my desired information was located. Once I determined that, I made requests from reddit for posts and got around 800 posts for each subreddit. From there I explored the data more and decided it was best for me to use the title as the basis for my classification as there was not much selftext. 

#### From there I used a Count Vectorizer and TFIDF Vectorizer to transform the titles into easily digestable features for my models. For my targets , I made the 76ers a 1 and the NBA a 0. The models I decided to go with were the Logistic Regression, Random Forest Classifier, and the Multinomial Naive Bayes models. 

#### After doing Logistic Regression, I looked at words that came up the most frequently and it turned out to be mostly names of famous NBA players. For the NBA coefficients players that were famous industry-wide were listed and for the 76ers coefficients , star players of the team were mentioned a lot. 

#### Overall, all the model performed similarly, overfitting on training data and getting lower scores for test. Mulitnomial Naive Bayes performed the best, but it wasn't much better

## Data Science Process

<table style="width:25%" align="left">
  <tr>
    <th style="text-align:left">1. Define the problem.</th>
  </tr>
  <tr>
    <th style="text-align:left">2. Gather the data.</th>
  </tr>
  <tr>
    <th style="text-align:left">3. Explore the data.</th>
  </tr>
  <tr>
    <th style="text-align:left">4. Model the data.</th>
  </tr>
  <tr>
    <th style="text-align:left">5. Evaluate the model.</th>
  </tr> 
  <tr>
    <th style="text-align:left">6. Answer the problem.</th>
  </tr> 
</table>


# <u>1. Define the Problem :</u>
### What characteristics of a post on Reddit contribute most to what subreddit it belongs to?
#### Specifically, I'll be looking at which words in the title contribute most to the identifying which subreddit a post belongs to
#### I did not use self -text since some of the posts didn't have self-text


====================================================================================================================

# <u> 2. Gather the Data</u>
#### Here I'll webscrape posts from two subreddits to later process and create models from.
#### The two subreddits I'll be comparing in this iteration are the <u>NBA and 76ers</u> subreddits (76ers are a basketball team in the NBA)

====================================================================================================================

**Import Libraries**

In [98]:
import time
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words



**Function for webscraping from Reddit**

In [33]:
def get_data(User_agent , num_of_posts ,url ,next_post = "after"):
    
    #User_agent - user agent that will make requests
    #num_of_posts - number of times we request data 
    #url - url that we want to scrape from , should have .json at the end of it
    #next_post - user can choose whether to scrape to the previous('before') or next('after') post
    
    #list of all posts
    all_posts = []
    #URL
    URL = url
    #update 'after' with this parameter
    p = {}
    
    for i in range(num_of_posts):
        #get request
        res = requests.get(URL,params = p,headers = {"User-agent": User_agent})
        #wait till you have the data
        res.raise_for_status()
        #get json file
        data = res.json()
        
        #get posts for this request
        list_of_posts = data['data']["children"]
        
        #add to all_posts
        all_posts = all_posts + list_of_posts
        
        #get the after to add 
        after = data['data'][next_post]
        
        #If none , return 
        if after == None:
            print("No posts")
            break
        
        #otherwise update the p
        else:
            p.update({'after':after})
            #URL = URL+"?after=" + after
            print("The current after: ", after," ",i ,": size: ",len(list_of_posts))
        #time.sleep(sleep_time)
        
    #get the keys to name columns
    keyz =[x for x in all_posts[0]["data"].keys()]
    
    #list of values for the dataframe
    list_of_values = []
    
    for index in range(len(all_posts)):
        
        #get each request 
        valuez = [x for x in all_posts[index]["data"].values()]
        
        #append to list
        list_of_values.append(valuez)
        
        #if for some reason some posts have different number of columns
        #get the greatest number of columns and make those your columns names
        #for the DataFrame
        if len([x for x in all_posts[index]["data"].keys()])>len(keyz):
            keyz = [x for x in all_posts[index]["data"].keys()]
            
    #Create DataFrame        
    data_df = pd.DataFrame(columns = keyz, data = list_of_values)
    return data_df

**Get data from the sixers and NBA subreddits to compare**

In [34]:
#sixers = pd.DataFrame(get_data("Bobby",40,"https://www.reddit.com/r/sixers.json"))
#nba = pd.DataFrame(get_data("Bobby",40,"https://www.reddit.com/r/nba.json"))

**Export both DataFrames to a csv so I don't have to scrape again later**

In [35]:
# Export both to csv
# sixers.to_csv('both.csv')

# with open('both.csv', 'a',encoding='utf-8') as f:
#     nba.to_csv(f)

# #Export each dataframe individually just in case
# nba.to_csv('nba.csv')
# sixers.to_csv('sixers.csv')

In [36]:
both = pd.read_csv("both.csv").copy()

In [37]:
nba = both[both["subreddit"]=='nba'].copy()
sixers = both[both["subreddit"]=='sixers'].copy()
nba.reset_index(inplace = True)
sixers.reset_index(inplace = True)

In [38]:
sixers.drop(labels = ["index","Unnamed: 0"], axis = 1,inplace = True)
nba.drop(labels = ["index","Unnamed: 0"], axis = 1,inplace = True)

# 3. Explore the Data

In [39]:
sixers.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,crosspost_parent,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,media,is_video
0,,sixers,It's that time. The off-season clock appears t...,t2_fcq7v,False,,0,False,[Off-Season Thread] Waitin' for training camp...,[],...,https://www.reddit.com/r/sixers/comments/91uj4...,42353,1532543892.0,,False,,,,,
1,,sixers,,t2_txalw,False,,0,False,Ben with LeBron.,[],...,all_ads,False,https://www.instagram.com/p/BnWVp7wn0d_/?taken...,42353.0,1536160451.0,,False,,,
2,,sixers,,t2_kaek1sc,False,,0,False,#HereTheyCome Kelle hitting the gym! F2G,[],...,all_ads,False,https://i.redd.it/mp7g7g7tigk11.jpg,42353.0,1536169408.0,,False,,,
3,,sixers,,t2_12vc6g,False,,0,False,One of the trainers that is studying under Han...,[],...,all_ads,False,https://i.redd.it/z205by4ymhk11.jpg,42353.0,1536182917.0,,False,,,
4,,sixers,,t2_swp0e,False,,0,False,14$ well spent on my first jersey,[],...,all_ads,False,https://i.redd.it/h8nk24d0ygk11.jpg,42353.0,1536174546.0,,False,,,


In [40]:
nba.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,crosspost_parent,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,media,is_video
0,,nba,#[/r/NBA Rules](https://www.reddit.com/r/nba/w...,t2_6l4z3,,,0,False,Daily Locker Room and Free Talk + Game Threads...,[],...,1536153318.0,,False,,,,,,,
1,,nba,,t2_hn4t5,,,0,False,r/nba Best of August,[],...,1535901297.0,{'oembed': {'provider_url': 'http://imgur.com'...,False,,,,,,,
2,,nba,,t2_kk5v7,Misc. Media,,0,False,"New Nike commercial featuring Lebron, Kaeperni...",[],...,1536175454.0,{'oembed': {'provider_url': 'https://streamabl...,False,,,,,,,
3,,nba,,t2_ao907,,,0,False,LeBron James says he 'stands with Nike' in ref...,[],...,1536151457.0,,False,,,,,,,
4,,nba,,t2_nqx3n,Highlights,,0,False,Chris Paul taps D-Rose on the left side twice ...,[],...,1536177024.0,"{'type': 'streamable.com', 'oembed': {'provide...",False,,,,,,,


In [77]:
print("Sixers Null values - selftext, title : ",
sixers["selftext"].isna().sum() ,",", sixers["title"].isna().sum())
print("NBA Null values - selftext, title : ",
nba["selftext"].isna().sum() ,",", nba["title"].isna().sum())

Sixers Null values - selftext, title :  680 , 0
NBA Null values - selftext, title :  314 , 0


In [83]:
print("sixers rows: ",sixers.shape[0])
print("nba rows: ", nba.shape[0])

sixers rows:  816
nba rows:  778


**>>>> It seems that a large portion of the self-text of both subreddits are null values so I'm deciding to not look at them**

# 4. Model the Data

# <u> NLP:</u> 
# Logistic Regression , Random Forest Classifier, Multinomial  Naive Bayes 

============================================================================================================================

**Load features and target and do a train-test split**

In [43]:
X = both.title
y = both.subreddit.map(lambda x : x == both["subreddit"][0])*1

X_train, X_test, y_train, y_test = train_test_split(X,y)

In [44]:
#Function for printing out coefficents for each feature

def coef_name_values(model):
    #Get the feature names (words) and coefficents for each one
    vect_names = model.steps[0][1].get_feature_names()
    logis_coef = model.steps[1][1].coef_

    #Put them into a DataFrame
    coefs_of_names = pd.DataFrame(data = logis_coef, columns=vect_names)

    #Sort these coefs to see which ones pop-up the most
    print("Sorted Coefficent Values")
    return coefs_of_names.sort_values(by =0,axis = 1)
    

**Use Pipelines for Logistic Regression**

In [45]:
custom_stopwords = list(stop_words.ENGLISH_STOP_WORDS)
custom_stopwords.extend(['nba','sixers'])

In [62]:
#Count Vectorizer
steps1 = [
    ('Count_Vectorize', CountVectorizer(stop_words = custom_stopwords)),
    ('Log', LogisticRegression())
]

model_log = Pipeline(steps1)
model_log.fit(X_train,y_train)
model_log.score(X_train,y_train),model_log.score(X_test,y_test)

(0.9949832775919732, 0.7944862155388471)

In [54]:
coef_name_values(model_log).T

Sorted Coefficent Values


Unnamed: 0,0
players,-1.861337
lebron,-1.210550
deng,-1.204360
kobe,-1.169204
career,-1.147953
curry,-1.077196
playoffs,-1.072901
klay,-1.035761
jordan,-1.029370
durant,-1.021310


In [60]:
#TFIDF Vectroizer
steps1 = [
    ('TF_Vectorize', TfidfVectorizer(stop_words = custom_stopwords,)),
    ('Log', LogisticRegression())
]

model_log1 = Pipeline(steps1)
model_log1.fit(X_train,y_train)
model_log1.score(X_train,y_train),model_log1.score(X_test,y_test)

(0.9690635451505016, 0.7969924812030075)

In [49]:
coef_name_values(model_log1).T

Sorted Coefficent Values


Unnamed: 0,0
players,-2.357845
lebron,-1.755325
kobe,-1.377357
player,-1.252106
team,-1.211135
curry,-1.147751
deng,-1.115470
durant,-1.114854
career,-1.030400
kevin,-1.024363


**Use Pipelines for a Random Forest Classifier**

In [55]:
steps2 = [
    ('Count_Vectorize', CountVectorizer(stop_words = custom_stopwords)),
    ('RFC', RandomForestClassifier())
]

model_rfc = Pipeline(steps2)
model_rfc.fit(X_train,y_train)
model_rfc.score(X_train,y_train),model_rfc.score(X_test,y_test)

(0.9891304347826086, 0.7619047619047619)

In [66]:
steps2 = [
    ('TF_Vectorize', TfidfVectorizer(stop_words = stop_words.ENGLISH_STOP_WORDS)),
    ('RFC', RandomForestClassifier())
]

model_rfc1 = Pipeline(steps2)
model_rfc1.fit(X_train,y_train)
model_rfc1.score(X_train,y_train), model.score(X_test,y_test)

(0.9899665551839465, 0.8170426065162907)

**Use Pipelines for Multinomial Naive Bayes**

In [67]:
steps3 = [
   ('Count_Vectorize', CountVectorizer(stop_words = custom_stopwords)),
    ('MultiNomialNB', MultinomialNB())
]

model_nb = Pipeline(steps3)
model_nb.fit(X_train,y_train)
model_nb.score(X_train,y_train),model.score(X_test,y_test)

(0.9665551839464883, 0.8170426065162907)

In [68]:
steps3 = [
    ('TF_Vectorize', TfidfVectorizer(stop_words = custom_stopwords)),
    ('MultiNomialNB', MultinomialNB())
]



model_nb1 = Pipeline(steps3)
model_nb1.fit(X_train,y_train)
model_nb1.score(X_train,y_train), model.score(X_test,y_test)

(0.9749163879598662, 0.8170426065162907)

# 5. Evaluate the Models

In [127]:
labels = ["NBA","76ers"]
def print_cm(confusion_mat):
    confusion_mat.columns = ["Predicted NBA","Predicted 76ers"] 
    confusion_mat.index = ["True NBA" ,"True 76ers"]
    return confusion_mat

In [128]:
y_pred_log = model_log.predict(X_test)
y_pred_log1 = model_log1.predict(X_test)
y_pred_rfc = model_rfc.predict(X_test)
y_pred_rfc1 = model_rfc1.predict(X_test)
y_pred_nb = model_nb.predict(X_test)
y_pred_nb1 = model_nb1.predict(X_test)


In [133]:
#Logiststic Regression with CountVec
print(classification_report(y_test, y_pred_log))
cm_log = pd.DataFrame(confusion_matrix(y_test, y_pred_log))
print_cm(cm_log)

             precision    recall  f1-score   support

          0       0.80      0.76      0.78       189
          1       0.79      0.83      0.81       210

avg / total       0.79      0.79      0.79       399



Unnamed: 0,Predicted NBA,Predicted 76ers
True NBA,143,46
True 76ers,36,174


In [134]:
#Logiststic Regression with TFIDFVec
print(classification_report(y_test, y_pred_log1))
cm_log1 = pd.DataFrame(confusion_matrix(y_test, y_pred_log1))
print_cm(cm_log1)

             precision    recall  f1-score   support

          0       0.78      0.80      0.79       189
          1       0.81      0.80      0.80       210

avg / total       0.80      0.80      0.80       399



Unnamed: 0,Predicted NBA,Predicted 76ers
True NBA,151,38
True 76ers,43,167


In [132]:
#Random Forest with CountVec
print(classification_report(y_test, y_pred_rfc))
cm_rfc = pd.DataFrame(confusion_matrix(y_test, y_pred_rfc))
print_cm(cm_rfc)

Unnamed: 0,Predicted NBA,Predicted 76ers
True NBA,120,69
True 76ers,26,184


In [135]:
#RandomForest with TFIDF
print(classification_report(y_test, y_pred_rfc1))
cm_rfc1 = pd.DataFrame(confusion_matrix(y_test, y_pred_rfc1))
print_cm(cm_rfc1)

             precision    recall  f1-score   support

          0       0.83      0.70      0.76       189
          1       0.77      0.87      0.82       210

avg / total       0.80      0.79      0.79       399



Unnamed: 0,Predicted NBA,Predicted 76ers
True NBA,133,56
True 76ers,27,183


In [136]:
#Mulitnomial NB with CountVec
print(classification_report(y_test, y_pred_nb))
cm_nb = pd.DataFrame(confusion_matrix(y_test, y_pred_nb))
print_cm(cm_nb)

             precision    recall  f1-score   support

          0       0.82      0.78      0.80       189
          1       0.81      0.85      0.83       210

avg / total       0.82      0.82      0.82       399



Unnamed: 0,Predicted NBA,Predicted 76ers
True NBA,148,41
True 76ers,32,178


In [137]:
#Mulitnomial NB with TFIDF
print(classification_report(y_test, y_pred_nb1))
cm_nb1 = pd.DataFrame(confusion_matrix(y_test, y_pred_nb1))
print_cm(cm_nb1)

             precision    recall  f1-score   support

          0       0.83      0.77      0.80       189
          1       0.81      0.86      0.83       210

avg / total       0.82      0.82      0.82       399



Unnamed: 0,Predicted NBA,Predicted 76ers
True NBA,146,43
True 76ers,30,180


# 6. Answer the Problem
**What characteristics of a post on Reddit contribute most to what subreddit it belongs to?**


I specifically looked at what words in the title contributed to its classification. It turns out that names of players that are big in those fandoms were the ones that ended up determining the classification. Words like Curry, Lebron, and Kobe were words that made the classifier think a post belonged to the NBA subreddit. Words like Embiid, Joel, and Ben identified a post as being a part of 76ers subreddit. 