## Ah, Reddit

The front page of the internet as exclaimed by itself and rightly so. You can find posts about anything and everything over there. Every community in reddit is known as a "subreddit" and users there are called "redditors"

## About r/india

This is the official subreddit of everything about India. 

## Flairs

These are something which can be best defined as subtopics in a subreddit that are set by the moderators of the subreddit. I will slowly show what they are in this notebook.


# The toolkit

In [None]:
#The baseline modules
import numpy as np
import pandas as pd

#For text cleaning
import spacy

#For plotting
import missingno as msno
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

#Model packages
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

#Pipeline, Vectorizers and accuracy metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
df = pd.read_csv('../input/reddit-india-flair-detection/datafinal.csv', index_col='Unnamed: 0')
df.head()

# Null values

In [None]:
df.isnull().sum().any()

In [None]:
df.isnull().sum()

In [None]:
msno.matrix(df)
plt.show()

# Cleaning Data

## Removing uneccessary columns

In [None]:
df.columns

Things that can be removed:

* score - Karma(basically the number of upvotes a post gets) doesn't contribute in figuring out a flair
* url - We can't do much off the url of the post
* comms_num - Number of comments isn't that important
* timestamp - Timestamp cannot factor into predicting the flair 
* author - We can't base the flair based on who writes it

In [None]:
df.drop(['score','url','comms_num','author','timestamp'], axis=1, inplace=True)
df.head()

In [None]:
df['title'][0]

In [None]:
df['body'][0]

In [None]:
df['comments'][0]

In [None]:
df['combined_features'][0]

Well will you look at that. The combined features column is just the combo of title, body and comments. So, that's safe to drop as well

In [None]:
df.drop(['combined_features'], axis=1, inplace=True)
df.head()

**Time to explore the flairs cause that's our target to predict**

In [None]:
df.info()

In [None]:
df.describe()

How many flairs are present?

In [None]:
df['flair'].unique()

In [None]:
df.groupby('flair')['title'].describe()

In [None]:
fla_df = pd.DataFrame({"Flair":df['flair'].unique(), "Number":df.groupby('flair')['title'].describe()['freq']})

fig = px.bar(fla_df, x='Flair', y='Number', title='Flair Counts by Title in r/india')
fig.show()

Same thing can be followed for body and comments

In [None]:
fla_df_1 = pd.DataFrame({"Flair":df['flair'].unique(), "Number":df.groupby('flair')['body'].describe()['freq']})

fig = px.bar(fla_df_1, x='Flair', y='Number', title='Flair Counts by body in r/india')
fig.show()

In [None]:
fla_df_2 = pd.DataFrame({"Flair":df['flair'].unique(), "Number":df.groupby('flair')['comments'].describe()['freq']})

fig = px.bar(fla_df_2, x='Flair', y='Number', title='Flair Counts by comments in r/india')
fig.show()

In [None]:
df[df['flair'] == np.nan].describe()

In [None]:
df.dropna(subset=['flair'], inplace=True)

In [None]:
df.dtypes

We can combine title, body and comments into a single column called text 

In [None]:
df['text'] = df['title'].astype(str) + df['body'].astype(str) + df['comments'].astype(str)
df.drop(['title', 'body', 'comments'], axis=1, inplace=True)
df.head()

Now to clean the text

# Normalisation

In [None]:
nlp = spacy.load('en')

def normalize(msg):
    
    doc = nlp(msg)
    res=[]
    
    for token in doc:
        if(token.is_stop or token.is_punct or not(token.is_oov)): #Removing stopwords punctuations and words out of vocab
            pass
        else:
            res.append(token.lemma_.lower())
    
    return " ".join(res)

In [None]:
df['text'] = df['text'].apply(normalize)
df.head()

# Model Training and Prediction

In [None]:
c = TfidfVectorizer() # Convert our strings to numerical values
mat=pd.DataFrame(c.fit_transform(df["text"]).toarray(),columns=c.get_feature_names(),index=None)
mat

In [None]:
X = mat
y = df["flair"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Gradient Boosting Classifier takes the crown with a 81.14% accuracy but since it takes too long we'll go with Decision Tree Classifier

# Final Model

In [None]:
pipeline = Pipeline([
    ('classifier',DecisionTreeClassifier()),
    ])

pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
print("Accuracy: {:.2f} %".format(accuracy_score(y_test, y_pred)*100))

# Final Output

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['flair'], test_size = 0.2, random_state = 0)

In [None]:
ids = [df.iloc[int(i)]['id'] for i in X_test.index]
final_df = pd.DataFrame({"ID":ids, "Text":X_test, "Flair":y_pred}).reset_index()

final_df.head()

In [None]:
final_df.to_csv('./test.csv')

# Finding Best Classifier

These cells are used to find which classifier is the best. Takes a VERY long time

In [None]:
'''classifiers = {
    'mnb': MultinomialNB(),
    'gnb': GaussianNB(),
    'svm1': SVC(kernel='linear'),
    'svm2': SVC(kernel='rbf'),
    'svm3': SVC(kernel='sigmoid'),
    'mlp1': MLPClassifier(),
    'mlp2': MLPClassifier(hidden_layer_sizes=[100,100]),
    'ada': AdaBoostClassifier(),
    'dtc': DecisionTreeClassifier(),
    'rfc': RandomForestClassifier(),
    'gbc': GradientBoostingClassifier(),
    'lr': LogisticRegression()
}'''

In [None]:
'''acc_scores = dict()
for classifier in classifiers:
    pipeline = Pipeline([
    ('classifier',classifiers[classifier]),
    ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    acc_scores[classifier] = accuracy_score(y_test, y_pred)
    print(classifier, acc_scores[classifier])'''