# About this Notebook

In this kernel, I will briefly explain the structure of dataset.I will generate and analyze metafeatures. Then, I will visualize the dataset using Matplotlib, seaborn and Plotly to gain as much insight as I can . Also I will approach this problem as an NLP Classification problem to build a model

In case you are just starting with NLP here is a guide to Approach almost any NLP Problem by Grandmaster [**@Abhishek Thakur**](https://www.slideshare.net/abhishekkrthakur/approaching-almost-any-nlp-problem)

**<span style="color:Red">If you find this kernel useful, Please Upvote it , it motivates me to write more Quality content**

In [None]:
import string
import numpy as np 
import random
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from collections import Counter
from fastai.text import *
from fastai.callbacks import *
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import os

import nltk
from nltk.corpus import stopwords

from tqdm import tqdm
import os
import nltk
import random
import warnings
warnings.filterwarnings("ignore")

* Below a Helper Function that generates random colors

In [None]:
def random_colours(number_of_colors):
    '''
    Simple function for random colours generation.
    Input:
        number_of_colors - integer value indicating the number of colours which are going to be generated.
    Output:
        Color in the following format: ['#E86DA4'] .
    '''
    colors = []
    for i in range(number_of_colors):
        colors.append("#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]))
    return colors

In [None]:
train = pd.read_csv('../input/60k-stack-overflow-questions-with-quality-rate/data.csv')

In [None]:
train.info()

* We have 60k rows 6 columns

# EDA

In [None]:
train.head()

* Distribution in 50% of data

In [None]:
temp = train.sample(frac=0.5).groupby('Y').count()['Body'].reset_index().sort_values(by='Body',ascending=False)
temp.style.background_gradient(cmap='Purples')

In [None]:
fig = go.Figure(go.Funnelarea(
    text =temp.Y,
    values = temp.Body,
    title = {"position": "top center", "text": "Funnel-Chart of Question Quality Distribution"}
    ))
fig.show()

## Generating Meta Features

* Difference In Number Of words of title and body
* Jaccard Similarity Scores between title and body

For what who don't know what Jaccard Similarity is : https://www.geeksforgeeks.org/find-the-jaccard-index-and-jaccard-distance-between-the-two-given-sets/


In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [None]:
results_jaccard=[]

for ind,row in train.iterrows():
    sentence1 = row.Title
    sentence2 = row.Body

    jaccard_score = jaccard(sentence1,sentence2)
    results_jaccard.append([sentence1,sentence2,jaccard_score])

In [None]:
jaccard = pd.DataFrame(results_jaccard,columns=["Title","Body","jaccard_score"])
train = train.merge(jaccard,how='outer')

In [None]:
train['Num_words_body'] = train['Body'].apply(lambda x:len(str(x).split())) #Number Of words in Selected Text
train['Num_words_title'] = train['Title'].apply(lambda x:len(str(x).split())) #Number Of words in main text
train['difference_in_words'] = abs(train['Num_words_body'] - train['Num_words_title']) #Difference in Number of words text and Selected Text

In [None]:
train.head()

* Let's look at the distribution of Meta-Features

In [None]:
plt.figure(figsize=(12,6))
p1=sns.kdeplot(train['Num_words_body'], shade=True, color="r").set_title('Kernel Distribution of Number Of words')
p1=sns.kdeplot(train['Num_words_title'], shade=True, color="b")
plt.xlim(0,300)

* **Of course question body will have more words than the title**

**Now It will be more interesting to see the differnce in number of words and jaccard_scores across different Segment**

In [None]:
train.Y.unique()

In [None]:
plt.figure(figsize=(12,6))
p1=sns.kdeplot(train[train['Y']=='HQ']['difference_in_words'], shade=True, color="b").set_title('Kernel Distribution of Difference in Number Of words')
p2=sns.kdeplot(train[train['Y']=='LQ_CLOSE']['difference_in_words'], shade=True, color="r")
p2=sns.kdeplot(train[train['Y']=='LQ_EDIT']['difference_in_words'], shade=True, color="g")
plt.legend(labels=['HQ','LQ_CLOSE','LQ_EDIT'])
plt.xlim(-20,500)

In [None]:
plt.figure(figsize=(12,6))
p1=sns.kdeplot(train[train['Y']=='HQ']['jaccard_score'], shade=True,).set_title('KDE of Jaccard Scores across different Quality Question')
p2=sns.kdeplot(train[train['Y']=='LQ_CLOSE']['jaccard_score'], shade=True, )
p3=sns.kdeplot(train[train['Y']=='LQ_EDIT']['jaccard_score'], shade=True, )
plt.legend(labels=['HQ','LQ_CLOSE','LQ_EDIT'])
plt.xlim(-0.05,0.4)

## EDA of Conclusion
* Target distribution is almost identical for all 3 categories
* `LQ_EDIT` questions have less difference in num of words between **Body** and **Title**.


## Cleaning the Corpus

Now Before We Dive into extracting information out of words in title and body,let's first clean the data

In [None]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [None]:
train['Title'] = train['Title'].apply(lambda x:clean_text(x))
train['Body'] = train['Body'].apply(lambda x:clean_text(x))

In [None]:
train.head()

## Most Common words in our Body

In [None]:
train['temp_list'] = train['Body'].apply(lambda x:str(x).split())
top = Counter([item for sublist in train['temp_list'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

In [None]:
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Body', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

**OOPS!** While we cleaned our dataset we didnt remove the stop words and hence we can see one of the most common words is 'to'. 
Let's try again after removing the stopwords.

In [None]:
fig = px.treemap(temp, path=['Common_words'], values='count',title='Tree of Most Common Words in the body')
fig.show()

# Most Common words in Title

Let's also look at the most common words in Title

In [None]:
def remove_stopword(x):
    return [y for y in x if y not in stopwords.words('english')]

In [None]:
train['temp_list1'] = train['Title'].apply(lambda x:str(x).split()) #List of words in every row for text
train['temp_list1'] = train['temp_list1'].apply(lambda x:remove_stopword(x)) #Removing Stopwords

In [None]:
top = Counter([item for sublist in train['temp_list1'] for item in sublist])
temp = pd.DataFrame(top.most_common(25))
temp = temp.iloc[1:,:]
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

In [None]:
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Title', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

* Top 3 Question words are Regarding **C, python, and error**

# Most common words Question Quality Wise

Let's look at the most common words in different question qualities

In [None]:
hq = train[train['Y']=='HQ']
lq_edit = train[train['Y']=='LQ_EDIT']
lq_close = train[train['Y']=='LQ_CLOSE']

In [None]:
#MosT common HQ words
top = Counter([item for sublist in hq['temp_list'] for item in sublist])
temp_p = pd.DataFrame(top.most_common(20))
temp_p.columns = ['Common_words','count']
temp_p.style.background_gradient(cmap='Greens')

In [None]:
fig = px.bar(temp_p, x="count", y="Common_words", title='Most Commmon HQ words', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

In [None]:
#MosT common lq_edit words
top = Counter([item for sublist in lq_edit['temp_list'] for item in sublist])
temp_n = pd.DataFrame(top.most_common(20))
temp_n = temp_n.iloc[1:,:]
temp_n.columns = ['Common_words','count']
temp_n.style.background_gradient(cmap='Reds')

In [None]:
fig = px.treemap(temp_n, path=['Common_words'], values='count',title='Tree Of Most Common LQ_EDIT Words')
fig.show()

In [None]:
#MosT common lq_close words
top = Counter([item for sublist in lq_close['temp_list'] for item in sublist])
temp_n = pd.DataFrame(top.most_common(20))
temp_n = temp_n.loc[1:,:]
temp_n.columns = ['Common_words','count']
temp_n.style.background_gradient(cmap='Reds')

In [None]:
fig = px.bar(temp_n, x="count", y="Common_words", title='Most Commmon LQ_CLOSE words', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

In [None]:
fig = px.treemap(temp_n, path=['Common_words'], values='count',title='Tree Of Most LQ_CLOSE Words')
fig.show()

* We can see words like **i,to , a, and, the,is** are common in all three segments .

## Let's Look at Unique Words in each Segment

We will look at unique words in each segment in the Following Order:
* HQ
* LQ_EDIT
* LQ_CLOSE

In [None]:
raw_text = [word for word_list in train['temp_list1'] for word in word_list]

In [None]:
def words_unique(segment,numwords,raw_words):
    '''
    Input:
        segment - Segment category (ex. 'HQ,LQ_EDIT');
        numwords - how many specific words do you want to see in the final result; 
        raw_words - list  for item in train_data[train_data.segments == segments]['temp_list1']:
    Output: 
        dataframe giving information about the name of the specific ingredient and how many times it occurs in the chosen cuisine (in descending order based on their counts)..

    '''
    allother = []
    for item in train[train.Y != segment]['temp_list1']:
        for word in item:
            allother .append(word)
    allother  = list(set(allother ))
    
    specificnonly = [x for x in raw_text if x not in allother]
    
    mycounter = Counter()
    
    for item in train[train.Y == segment]['temp_list1']:
        for word in item:
            mycounter[word] += 1
    keep = list(specificnonly)
    
    for word in list(mycounter):
        if word not in keep:
            del mycounter[word]
    
    Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns = ['words','count'])
    
    return Unique_words

## HQ Questions

In [None]:
Unique_P= words_unique('HQ', 10, raw_text)
print("The top 10 unique words in HQ are:")
Unique_P.style.background_gradient(cmap='Greens')

In [None]:
from palettable.colorbrewer.qualitative import Pastel1_7
plt.figure(figsize=(16,10))
my_circle=plt.Circle((0,0), 0.7, color='white')
plt.pie(Unique_P['count'], labels=Unique_P.words, colors=Pastel1_7.hex_colors)
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('DoNut Plot Of Unique HQ Words')
plt.show()

In [None]:
Unique_lqedit = words_unique('LQ_EDIT', 10, raw_text)
print("The top 10 unique words in LQ_EDIT are:")
Unique_lqedit.style.background_gradient(cmap='Reds')

In [None]:
from palettable.colorbrewer.qualitative import Pastel1_7
plt.figure(figsize=(16,10))
my_circle=plt.Circle((0,0), 0.7, color='white')
plt.rcParams['text.color'] = 'black'
plt.pie(Unique_lqedit['count'], labels=Unique_lqedit.words, colors=Pastel1_7.hex_colors)
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('DoNut Plot Of Unique LQ_EDIT Words')
plt.show()

In [None]:
Unique_N= words_unique('LQ_CLOSE', 10, raw_text)
print("The top 10 unique words in LQ_CLOSE are:")
Unique_N.style.background_gradient(cmap='Oranges')

In [None]:
from palettable.colorbrewer.qualitative import Pastel1_7
plt.figure(figsize=(16,10))
my_circle=plt.Circle((0,0), 0.7, color='white')
plt.pie(Unique_N['count'], labels=Unique_N.words, colors=Pastel1_7.hex_colors)
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('DoNut Plot Of Unique LQ_CLOSE Words')
plt.show()

**By Looking at the Unique Words of each segment ,we now have much more clarity about the data,these unique words are very strong determiners of segment of questions**

## It's Time For WordClouds

We will be building wordclouds in the following order:

* WordCloud of HQ Questions
* WordCloud of LQ_EDIT Questions
* WordCloud of LQ_CLOSE Questions

In [None]:
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(20.0,8.0), color = 'white',
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'u', "im"}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color=color,
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=800, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  
d = '/kaggle/input/masks-for-wordclouds/'

In [None]:
plot_wordcloud(hq.Body,color='white',max_font_size=100,title_size=30,title="WordCloud of HQ Questions")

In [None]:
plot_wordcloud(lq_edit.Body,color='white',max_font_size=100,title_size=30,title="WordCloud of LQ_EDIT Questions")

In [None]:
plot_wordcloud(lq_close.Body,color='white',max_font_size=100,title_size=30,title="WordCloud of LQ_CLOSE Questions")

# Modeling the Problem as NLP Text Classification Task



**Text classification is the process of assigning tags or categories to text according to its content. 
It's one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.**

* We will use a AWD_LSTM arch.
* First we will build a language model that better understands questions language.
* Then using the language model we will build a classifier

In [None]:
df = train[['Title','Body','Y']].copy()
df.head()

In [None]:
path = Path('/kaggle/input/')

## Language Model for questions

In [None]:
data_lm = (TextList.from_df(df, path, cols=['Title','Body'] ) # Create A text list for model
                   .split_by_rand_pct(0.2)  # how to split data, 80% train, 20% validation
                   .label_for_lm() # label according to a language model
                   .databunch(bs=64)) # create a databunch

In [None]:
data_lm.save('/kaggle/working/data_lm.pkl')

In [None]:
data_lm.show_batch(rows=5)

In [None]:
learn = language_model_learner(data_lm,AWD_LSTM,drop_mult=0.4,
                               metrics=[accuracy,Perplexity()],model_dir='/kaggle/working/').to_fp16()

In [None]:
callbacks = SaveModelCallback(learn, monitor="perplexity", mode="min", name="best_lang_model")

In [None]:
lr = 5e-02
moms = (0.8, 0.7)
wd = 0.1

### Training the language model

In [None]:
learn.fit_one_cycle(8, slice(lr), moms=moms, wd=wd, callbacks=[callbacks])

In [None]:
learn.load('best_lang_model');

In [None]:
txt = 'the question is very simple'
[learn.predict(txt,n_words=30,temperature=0.5) for i in range(5)]

In [None]:
learn.save_encoder('ftenc')

In [None]:
learn = None
gc.collect()

## Now Time to Build a Classifier

In [None]:
data_cls = (TextList.from_df(df, path, cols=['Title','Body'], vocab=data_lm.vocab)
            # Creating a textlist for lang model df--> dataframe , cols = Columns of df you want to include in classifier model , vocab=we will use same vacab we use to create a language model
                    .split_by_rand_pct(0.2,seed=64)
            #   will take 20% of text as validation set
                    .label_from_df(cols='Y')
            # label the classifier from dataframe cols= target columns name
                    .databunch(bs=128))
            # creates a databunch

In [None]:
data_cls.show_batch(rows=5)

In [None]:
clf = text_classifier_learner(data_cls, AWD_LSTM, metrics=[accuracy], drop_mult=0.3,model_dir='/kaggle/working/').to_fp16()
clf.load_encoder('/kaggle/working/ftenc');

## Classifier Model Summary

In [None]:
clf.summary()

In [None]:
gc.collect()

In [None]:
cb = SaveModelCallback(clf, monitor="accuracy", mode="max", name="best_clf")

In [None]:
clf.unfreeze()
clf.fit_one_cycle(8, 1e-2 ,moms=(0.8,0.7), callbacks=[cb])

## Classfier Interpretation

In [None]:
clf.load('best_clf');

In [None]:
interp = TextClassificationInterpretation.from_learner(clf)

In [None]:
interp.show_intrinsic_attention("why are java optionals immutable")

In [None]:
interp.show_intrinsic_attention("why ternary operator in swift is so picky")

In [None]:
interp.plot_confusion_matrix(figsize=(5,5))

## Lets see our top losses

In [None]:
interp.show_top_losses(10)

<h2> <span style="color:Red">I hope you Liked my kernel. An upvote is a gesture of appreciation and encouragement, to keep improving my efforts ,be kind to show one.</h2>