## Text Modeling techniques:
In this notebook, I am going to use spacy, nltk and other libraries to explore different basic NLP modeling techniques. We will be training a bunch of small text classification methods. This is a good resource notebook for those who are starting with NLP and want to explore the techniques and simple methodologies.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import spacy
import nltk
import re
from nltk.corpus import stopwords

In [None]:
text_data = pd.read_csv('/kaggle/input/dataisbeautiful/r_dataisbeautiful_posts.csv')
print("the data shape is:",text_data.shape)
text_data.head(5)

In [None]:
cols = list(text_data.columns)

In [None]:
print(cols)

## Basic information:
The data shape is 1,93,091,i.e. 193k rows are there. This is moderately big dataset.<br/>
So the reddit data has the following columns:<br/>
(1) **id**: this represents a unique id for each post.<br/>
(2) **title**: each reddit post contains a title, this is the title text.<br/>
(3) **score**: each reddit post can be upvoted or downvoted. And thereby receives a score. This is that score.<br/>
(4) **author**: this is basically user name.<br/>
(5) **author_flair_text**: Need to know exactly what does it represent. We will inspect the data first and check in other notebooks too.<br/>
(6) **removed_by**: this is removed by what option. i.e. if the post is eventually removed, who removed it. This is a very interesting source of data.<br/>
(7) **created_utc**: when was this post created in utc timing but it is in [unix epoch](https://www.utctime.net/) format. We need to transform it into normal date time to work on it. <br/>
(8) **full_link**: what is the full_link for the reddit post. This url will contain reddit's domain, subreddit and other informations. Need to parse the link to create subreddit and other different informations.<br/>
(9)**num_comments**: this is total number of comments which were seen in the post.<br/>
(10) **over_18**: this is basically the NSFW tag in reddit. Denotes whether the post contains something adult or not.<br/>

## First action: Build a NSFW classifier
In this section we are going to create different features and going to build a NSFW classifier.

In [None]:
re.sub('\[oc\]','','[oc]granny')

In [None]:
text = text_data.drop(['id','author','author_flair_text',
                       'created_utc','awarders'],axis = 1)
text['over_18'] = text['over_18']*1
def replacer(x):
    return re.sub('\[OC\]','',x)
def replacer_fulllink(x):
    path_reduced = re.sub('https://www.reddit.com/r/','',x)
    path_reduced_list = path_reduced.split("/")
    return path_reduced_list[0]
text['title'] = text['title'].apply(lambda x: replacer(str(x)))
text['removed_by'] = text['removed_by'].fillna("")
text['subreddit'] = text['full_link'].apply(lambda x: replacer_fulllink(x))

In [None]:
text['subreddit'].value_counts()

In [None]:
text.head()

In [None]:
text['over_18'].value_counts()

In [None]:
import seaborn as sns
from scipy.stats import norm
sns.distplot(text['score'].tolist(),fit = norm, kde = False)

In [None]:
text['log_score'] = text['score'].apply(lambda x: np.log(x+1)/np.log(10))

In [None]:
sns.distplot(text['log_score'].tolist(),fit = norm, kde = False)

In [None]:
print("value less than 3 is",text[text['log_score']<=3].shape)
print("value more than 3 is",text[text['log_score']>3].shape)

The data is extremely imbalanced. So we will have to keep that in mind.

In [None]:
text['removed_by'].value_counts()

In [None]:
for elem in ['moderator','deleted','automod_filtered',
             'reddit','author']:
    text['removed_by_'+elem] = text['removed_by'].apply(lambda x: (x==elem)*1.0)
text['Not_removed'] = text['removed_by'].apply(lambda x: (x=='')*1.0)

In [None]:
text = text.drop(['subreddit','full_link','removed_by'],axis = 1)

In [None]:
text.columns

## Bag of word creation:
For NSFW posts, let's create a bag of words from the nsfw posts which are never present in non-nsfw posts. Then we will create features out of this.

In [None]:
def text_cleaning(text):
    forbidden_words = set(stopwords.words('english'))
    if text:
        text = ' '.join(text.split('.'))
        text = re.sub('\/',' ',text)
        text = re.sub(r'\\',' ',text)
        text = re.sub(r'((http)\S+)','',text)
        text = re.sub(r'\s+', ' ', re.sub('[^A-Za-z]', ' ', text.strip().lower())).strip()
        text = re.sub(r'\W+', ' ', text.strip().lower()).strip()
        text = [word for word in text.split() if word not in forbidden_words]
        return text
    return []

In [None]:
re.sub(r'\\',' ','aof\god')

In [None]:
text['title'] = text['title'].apply(lambda x: ' '.join(text_cleaning(x)))

In [None]:
nsfw_text = ''
sfw_text = ''
for elem in text['title'][text['over_18']==1].tolist():
    nsfw_text = nsfw_text+elem
for elem in text['title'][text['over_18']==0].tolist():
    sfw_text = sfw_text+elem

In [None]:
def return_top_words(text,words = 10):
    allWords = nltk.tokenize.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    
    mostCommontuples= allWordExceptStopDist.most_common(words)
    mostCommon = [tupl[0] for tupl in mostCommontuples]
    return mostCommon

In [None]:
top_200_nsfw_words = return_top_words(nsfw_text,400)
top_200_sfw_words = return_top_words(sfw_text,400)
top_nsfw_exclusive = list(set(top_200_nsfw_words).difference(set(top_200_sfw_words)))
top_sfw_exclusive = list(set(top_200_sfw_words).difference(set(top_200_nsfw_words)))

In [None]:
total_vocab = top_nsfw_exclusive + top_sfw_exclusive
for word in total_vocab:
    text['Is_'+word+'_in_title'] = text['title'].apply(lambda x: (word in x)*1.0)

In [None]:
text = text.drop('title',axis = 1)

In [None]:
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import classification_report
Y = text['over_18']
X = text.drop('over_18',axis = 1)
X_train,X_val,Y_train,Y_val = tts(X,Y,test_size = 0.2,stratify = Y,random_state = 42)

In [None]:
X_train = X_train.fillna(0)
X_val = X_val.fillna(0)

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFC
forest = RFC(n_estimators = 128,max_depth = 22,class_weight = {0:1,1:192},
            n_jobs = -1,random_state = 42)
forest.fit(X_train,Y_train)
pred_train = forest.predict(X_train)
pred_val = forest.predict(X_val)
print(classification_report(Y_train,pred_train))
print(classification_report(Y_val,pred_val))

So clearly, around 33% F1-score can be reached with a random forest for nsfw detection. We could use a LSTM model for this too, but as clearly the problem is not semantical so LSTM is not a suitable thing here according to what I think. We will now move on to the next problem.

## Score prediction from this dataset
As we already saw, the score is a very highly left skewed number; clearly a poisson variable with inflation around 0. We will try to predict the scores, or atleast 0 and non-zero from title, award number and other variables in the dataset.

First things first, let's drop the useless columns. But wait, actually none of the features are not that insignificant. An author's name can have some implicit effect on scores. Also, link length, number of comments, total_awards_recieved, awarder list, whether it was removed or not, whether its NSFW or not every things will matter.<br/>
So we are dropping none other than id.

In [None]:
text_data = text_data.drop('id',axis = 1)

need to change the created_utc to see if there is any effect of time in score. This is unix epoch time, so will try something to decode it.

In [None]:
from datetime import datetime
ts = int("1284101485")

# if you encounter a "year is out of range" error the timestamp
# may be in milliseconds, try `ts /= 1000` in that case
print(datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))


In [None]:
def time_unix_change(x):
    x = int(x)
    return datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d  %H:%M:%S')

In [None]:
text_data['time'] = text_data['created_utc'].apply(lambda x: time_unix_change(x))

In [None]:
text_data.time.describe()

In [None]:
text_data.head()

In [None]:
text_data = text_data.drop(['author','full_link','created_utc'],axis = 1)
text_data['removed_by'] = text_data['removed_by'].fillna('')
text['over_18'] = text['over_18']*1
def replacer(x):
    return re.sub('\[OC\]','',x)
text_data['title'] = text_data['title'].apply(lambda x: replacer(str(x)))
text_data['num_title'] = text_data['title'].apply(lambda x: len(x))
text_data['over_18'] = text_data['over_18']*1.0

In [None]:
text_data['hr'] = text_data['time'].apply(lambda x: x[12:14])

In [None]:
text_data['author_flair_text'] = text_data['author_flair_text'].fillna('')
print(text_data['author_flair_text'].unique())

In [None]:
def flair_cleaner(x):
    x =  re.sub('\[OC\]','',x)
    x = re.sub('OC','',x)
    x = re.sub('|','',x)
    x = re.sub('oc','',x)
    x = re.sub(r'\s+', ' ', re.sub('[^A-Za-z]', ' ', x.strip().lower())).strip()
    return x

In [None]:
text_data['author_flair_text'] = text_data['author_flair_text'].apply(lambda x: flair_cleaner(x))

In [None]:
text_data = text_data.drop(['awarders','time'],axis = 1)

In [None]:
text_data.head()

In [None]:
text_data = text_data[~text_data['score'].isna()]

In [None]:
text_data['removed_by'].unique()

In [None]:
text_data['Not_removed'] = text_data['removed_by'].apply(lambda x: (x == '')*1.0)
for elem in ['automod_filtered', 'moderator', 'reddit', 'deleted', 'author']:
    text_data['removed_by_'+elem] = text_data['removed_by'].apply(lambda x: (x == elem)*1.0)

In [None]:
text_data = text_data.drop('removed_by',axis = 1)

In [None]:
text_data.head()

In [None]:
text_data['author_flair_text'].unique()

words like researcher, statistics, practitioner, nasa, institute, economics, prof gives out a sense of trust because the sub reddit is data related. So let's create a feature for that.

In [None]:
def expertise(x):
    count = 0
    for el in ['researcher','statistics','practitioner','nasa','economics','prof']:
        if el in x: count = count + 1
    return count
text_data['author_flair_text'] = text_data['author_flair_text'].apply(lambda x: expertise(x))

In [None]:
text_data = text_data.drop('title',axis = 1)

Time is a cyclic feature, so let's transform it.

In [None]:
text_data['hour_sin'] = text_data['hr'].apply(lambda x : np.sin(2 * np.pi * float(x)/23.0))
text_data['hour_cos'] = text_data['hr'].apply(lambda x : np.cos(2 * np.pi * float(x)/23.0))

In [None]:
text_data = text_data.rename(columns = {'author_flair_text':'expertise_count_author'})

In [None]:
text_data = text_data.drop('hr',axis = 1)

In [None]:
text_data.head()

In [None]:
sns.distplot(text_data['score'])

In [None]:
text_data['score'].value_counts()

In [None]:
text_data[text_data['score']<100].shape

Looks like that 1 is a very dominant score; as that is the first score which gets assigned to most cases; and therefore we will first train a model for 1 vs not 1. 

In [None]:
text_data['score_class'] = text_data['score'].apply(lambda x: (x == 1)*1.0)

In [None]:
scores = text_data['score'].tolist()
text_data = text_data.drop('score',axis = 1)

In [None]:
Y = text_data['score_class']
X = text_data.drop('score_class',axis = 1)
X_train,X_val,Y_train,Y_val = tts(X,Y,test_size = 0.2,stratify = Y,random_state = 42)

In [None]:
X_train = X_train.fillna(0)
X_val = X_val.fillna(0)

In [None]:
Y_train.isna().sum()

In [None]:
forest = RFC(n_estimators = 128,max_depth = 22,class_weight = {0:1.33,1:1},
            n_jobs = -1,random_state = 42)
forest.fit(X_train,Y_train)
pred_train = forest.predict(X_train)
pred_val = forest.predict(X_val)
print(classification_report(Y_train,pred_train))
print(classification_report(Y_val,pred_val))