<h1 style="text-align:center">Git Hub Prediction</h1>


In this work I have carried out detailed EDA for the GitHub Bug Prediction problem. This is the 1st notebook in the end-to-end implementation approach for solving GitHub Bug Prediction problem series.

<h3> Problem statement : </h3>
<p> For an issue posted on the GitHub, Predict whether that issue is a bug or a feature or a question based on the issue title and the body text.</p> 

## Importing Libraries

In [None]:
# Basic Libraries

import numpy as np
import pandas as pd
import re
import string
import random
import math
import time
import json
import os
import itertools
import collections
from collections import Counter, defaultdict
import nltk
import spacy
import pickle
from tqdm import tqdm


# Visualization

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

from plotly import tools
import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

from wordcloud import WordCloud,STOPWORDS

from sklearn.decomposition import PCA, TruncatedSVD, SparsePCA
from sklearn.manifold import TSNE


# Preprocessing

from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split, cross_val_score,  cross_val_predict
from sklearn.model_selection import StratifiedKFold, KFold, StratifiedShuffleSplit, GridSearchCV

from imblearn.over_sampling import ADASYN,SMOTE
from imblearn.under_sampling import NearMiss

from bs4 import BeautifulSoup

from nltk import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

nlp_spcy = spacy.load("en_core_web_sm", disable=["tagger", "parser","ner"])
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS = list(set(STOP_WORDS))

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from gensim.models import Word2Vec,KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec



import warnings
warnings.filterwarnings(action = "ignore")

### Data Description :


<p>  Data set contains only three fields as follows :
    <ul>
        <li>Title - the title of the GitHub bug/feature/question</li>
        <li>Body - the body of the GitHub bug/feature/question</li>
        <li>Label - the label we are trying to Predict for the given GitHub Issue. It contains the various classes of Labels as follows:
            <ol>
                <li>Bug - 0</li>
                <li>Feature - 1</li>
                <li>Question - 2</li>
            <ol>
        </li>
    </ul>
</p>

## Load Train Dataset

In [None]:
train_df = pd.read_json('../input/github-bugs-prediction/embold_train.json')
train_df.head()

## Check the shape of the train data

In [None]:
print('Number of data points : ', train_df.shape[0])
print('Number of features : ', train_df.shape[1])
print('Features : ', train_df.columns.values)

## Load Extra Train Dataset

In [None]:
train_extra_df = pd.read_json('../input/github-bugs-prediction/embold_train_extra.json')
train_extra_df.head()

## Check the shape of the extra train data

In [None]:
train_extra_df.shape

### Inference:

- Extra train data contains twice the number of data point compared to original train data.
- If we see any problems of 'Overfitting' while evaluating model on the original train data, this additional train data will be useful in such scenarios.
- We can use this extra train data when requierd for further analysis. 

## Check the basic stats of the data

In [None]:
train_df.info()

In [None]:
# check the basic stats
train_df.describe(include='all')

In [None]:
# check the data for null values
train_df.isnull().sum()

### Inferences:
- We can see there are duplicate values present for the 'title' feature with **323 (150000 - 149677)** repeated values.
- **'add unit tests'** is the most repeated 'title' which is repeated 15 times in the train_data.
- We can also see there are no duplicate entries present for the 'body' feature.
- Since the 'body' text is an unique entry for each issue, we can conclude that the underlying 'body' text associated with a GitHub issue is primarily responsible for categorising the issue into either of the bug/feature/question however 'title' is useful too as it briefly sets the context of the issue.
- For our analysis purpose we will merge both the title and text into a single feature.
- No null values are present across the train_data.

In [None]:
#Checking the duplicate entries for 'title'

train_df.loc[train_df['title'] == 'add unit tests']

## Combining Title and Body into a Single Feature for further analysis

In [None]:
train_df['text'] = train_df.title + ' ' + train_df.body
train_df.head(10)

## Distribution of data points amongst output labels

In [None]:
label_counts = train_df.label.value_counts().sort_index()
label_counts

In [None]:
#Check the percetage of data points in each category

(train_df.label.value_counts(normalize=True).sort_index())*100

In [None]:
print('Number of datapoints with label as Bug :',label_counts[0])
print('Number of datapoints with label as Feature :',label_counts[1])
print('Number of datapoints with label as Question :',label_counts[2])

## Plot the distribution of data points amongst output labels

In [None]:
plt.figure(figsize=(8,6))
label_counts.plot(kind='bar', color=['r','g','b'])

B = mpatches.Patch(color='r', label='Bug')
F = mpatches.Patch(color='g', label='Feature')
Q = mpatches.Patch(color='b', label='Question')

plt.legend(handles=[B,F,Q], loc='best')

plt.xlabel('Type of Labels')
plt.ylabel('Count of Data per Label Category')
plt.title('Distribution of labels in train data')
plt.show()

### Inference:

- we can see the distribution is labels is well balanced between 'Bug' and 'Feature' categories whereas the 'Question' types labels are comparitively very few.

## Filter the train_data based on each unique label category

In [None]:
Bug_data=train_df[train_df['label']== 0]
Feature_data=train_df[train_df['label']== 1]
Question_data=train_df[train_df['label']== 2]

In [None]:
print("First 10 rows of the 'text' feature with the Bug_data:\n", Bug_data['text'].head(10), "\n")
print("First 10 rows of the 'text' feature with the Feature_data:\n", Feature_data['text'].head(10), "\n")
print("First 10 rows of the 'text' feature with the Question_data:\n", Question_data['text'].head(10), "\n")

## Analyse the count of words in 'text' feature for each unique label category

In [None]:
count_text_Bug = Bug_data.text.str.split().apply(lambda w : len(w)).sort_values(ascending=True)
count_text_Feature = Feature_data.text.str.split().apply(lambda w : len(w)).sort_values(ascending=True)
count_text_Question = Question_data.text.str.split().apply(lambda w : len(w)).sort_values(ascending=True)

In [None]:
print("Count of words in text feature for Bug Data:\n",count_text_Bug,"\n")
print("Count of words in text feature for Feature Data:\n",count_text_Feature,"\n")
print("Count of words in text feature for Question Data:\n",count_text_Question,"\n")

In [None]:
print("The word count for text for Bug Data varies between a minimum of ", str(np.min(count_text_Bug.values)), "and maximum of ", str(np.max(count_text_Bug.values)) )
print("The word count for text for Feature Data varies between a minimum of ", str(np.min(count_text_Feature.values)), "and maximum of ", str(np.max(count_text_Feature.values)) )
print("The word count for text for Question Data varies between a minimum of ", str(np.min(count_text_Question.values)), "and maximum of ", str(np.max(count_text_Question.values)) ) 

## Plot the distribution of words in body text for each output labels

In [None]:
# Creating a Generic count_plot function with Seaborn

def plot_count_dist(count_Bug,count_Feature,count_Question,title_1,title_2,title_3,subtitle):
    fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(18,6))
    sns.distplot(count_Bug,ax=ax1,color='r')
    ax1.set_title(title_1)
    sns.distplot(count_Feature,ax=ax2,color='g')
    ax2.set_title(title_2)
    sns.distplot(count_Question,ax=ax3,color='b')
    ax3.set_title(title_3)
    fig.suptitle(subtitle)
    plt.show()

In [None]:
plot_count_dist(count_text_Bug,count_text_Feature,count_text_Question,'Bug','Feature','Question','Text Data Word Count Aanalysis')

### Inference:

- We can observe that the number of words in each of the text feature for Bug, Feature and Question type data mostly varies between 5 to 300 words.

## Analyse the distribution of Punctuations in 'text' feature for each unique label category

In [None]:
#Check the standard punctuation characters

print("Standard Punctuation characters:",string.punctuation)

In [None]:
count_text_punctuations_Bug = Bug_data.text.apply(lambda w : len([p for p in str(w) if p in string.punctuation])).sort_values()
count_text_punctuations_Feature = Feature_data.text.apply(lambda w : len([p for p in str(w) if p in string.punctuation])).sort_values()
count_text_punctuations_Question = Question_data.text.apply(lambda w : len([p for p in str(w) if p in string.punctuation])).sort_values()

In [None]:
print("Count of punctuations in text feature for Bug Data:\n",count_text_punctuations_Bug,"\n")
print("Count of punctuations in text feature for Feature Data:\n",count_text_punctuations_Feature,"\n")
print("Count of punctuations in text feature for Question Data:\n",count_text_punctuations_Question,"\n")

In [None]:
plot_count_dist(count_text_punctuations_Bug,count_text_punctuations_Feature,count_text_punctuations_Question,'Bug','Feature','Question','Text Data Punctuations Count Aanalysis')

In [None]:
#sample bug data with most number of punctuations

print('Bug: \n Sample Bug data with most number of punctuations = ', count_text_punctuations_Bug.max(), '\n')
print(train_df.iloc[count_text_punctuations_Bug.idxmax()]['text'])


In [None]:
#sample feature data with most number of punctuations

print('Feature: \n Sample Feature data with most number of punctuations = ', count_text_punctuations_Feature.max(), '\n')
print(train_df.iloc[count_text_punctuations_Feature.idxmax()]['text'])

In [None]:
#sample Question data with most number of punctuations

print('Question: \n Sample Question data with most number of punctuations = ', count_text_punctuations_Question.max(), '\n')
print(train_df.iloc[count_text_punctuations_Question.idxmax()]['text'])

### Inference:

- we can observe that the punctuations are heavily present across all the data for each of the unique labels.
* we can also observe that there is some noisy text such **"\\r"** is present widely in 'text' feature across all the label categories. This noisy data needs to be cleaned explicitly.
- this implies that we need to do a proper cleaning of punctuation characters in the text pre-processing phase.

## Plotting WordCloud for Unprocessed Data

In [None]:
#Creating a generic function for generating WordCloud across different label categories.

def generate_wordcloud(df,col,i,label):
    
    data = df[df.label == i][col].values
    
    wc = WordCloud(stopwords=STOPWORDS, background_color='black',
                   max_words=10000, min_font_size=6, min_word_length=1)
    wc.generate(' '.join(data))
    
    plt.figure(figsize=(15,15))
    plt.title('WordCloud for {}'.format(label), fontsize = 24)
    plt.imshow(wc)
    plt.axis("off")
    plt.show()

In [None]:
#%%time

labels = np.unique(train_df.label.values)
label_names = ['Bug','Feature','Question']
for i in tqdm(labels):
    generate_wordcloud(train_df,'text',i,label_names[i])

## Gram Statistic for Unprocessed Data

In [None]:
#Creating a generic function for plotting n-Grams across different label categories.

def gram_analysis(data,gram):
    stop_words_set = set(stopwords.words('english'))
    tokens=[t for t in data.lower().split(" ") if t!="" if t not in stop_words_set]
    ngrams=zip(*[tokens[i:] for i in range(gram)])
    final_tokens=[" ".join(z) for z in ngrams]
    return final_tokens


def gram_freq(df,gram_type,gram,col,i,label):
    body_text = " ".join(df[df.label == i][col].sample(200).values)
    toks = gram_analysis(body_text, gram)
    tok_freq = pd.DataFrame(data=[toks, np.ones(len(toks))]).T.groupby(0).sum().reset_index()
    tok_freq.columns = ['token','frequency']
    tok_freq = tok_freq.sort_values(by='frequency',ascending=False)
    
    #plt.figure(figsize=(15,8))
    plt.figure(figsize=(10,15))
    plt.title("{0} for most common tokens of {1} type".format(gram_type, label), fontsize = 24)
    #sns.barplot(x='token', y='frequency', data=tok_freq.iloc[:30])
    #plt.xticks(rotation=90)
    sns.barplot(x='frequency', y='token', data=tok_freq.iloc[:30])
    plt.show()
    
    return 

In [None]:
#%%time

labels = np.unique(train_df.label.values)
label_names = ['Bug','Feature','Question']

for gram_type, n_gram in tqdm(zip(('Bi-Gram','Tri-Gram', 'Penta-Gram'),(2,3,5))):
    for i in labels:
        gram_freq(train_df,gram_type,n_gram,'text',i,label_names[i])

## Plot the Most Common Words Count for each label categories (UnProcessed Data)

In [None]:
#Creating a generic function for plotting 50 most common words for each label category

def plot_common_words(df,col,i,label):
    
    data = df[df.label == i][col].values
    data_string = ' '.join(map(str,data))
    corpus = word_tokenize(data_string)
    
    freq_words = FreqDist(corpus)
    
    most_common = freq_words.most_common(50)
       
    """
    words = []
    count = []
    for w, c in most_common:
        if w not in stop_words:
            words.append(w)
            count.append(c)
    
    plt.figure(figsize=(22,6))
    sns.barplot(y=count, x=words)
    plt.title('50 Most common words for {}'.format(label), fontsize = 24)
    plt.show()
    """
    # plotly graphs are more readable and interactive
    
    fig = px.bar(pd.DataFrame(most_common, columns=['Words','Count']), x = "Words", y = "Count", title='50 Most common words for {}'.format(label),width=1200, height=700)
    fig.show()  
    return

In [None]:
#%%time

labels = np.unique(train_df.label.values)
label_names = ['Bug','Feature','Question']
for i in tqdm(labels):
    plot_common_words(train_df,'text',i,label_names[i])

# Data Preprocessing

#### Closer look at the text data

In [None]:
pd.DataFrame(train_df.text.value_counts())

## Text Pre-processing  -- Cleaning Redundant Data

As we have observed, the 'text' feature contains a lot of redundant entities like punctuations, stop words, url, htlml tags,etc. 
We will need to clean such data before we proceed with the word embedding and vector transformations. 
Removing below will sufficiently clean the text and will remove redundancies.

1. HTML codes
2. URLs
3. Emojis
4. Stopwords
5. Punctuations
6. Expanding Abbreviations

### Check the Stop Words list

In [None]:
print("Total count of standard stop words list from SpaCy :",len(STOP_WORDS))
print("\nStandard stop words list from SpaCy :\n", STOP_WORDS)

In [None]:
#add some redundant words like 'elif' in the stop words list
STOP_WORDS = STOP_WORDS + ['elif']
print('elif' in STOP_WORDS)

In [None]:
#Discard negative words like 'not'and 'no' from this list

#stop_words.remove('not')
#stop_words.discard('no')
#print(len(stop_words))

In [None]:
# Creating a sigle Generic Function for text cleaning (we can create class as well here)

def text_cleaning(df,col,clean_col):
    
    #cleaning abbreviated words 
    def remove_contractions(data):
        data = re.sub(r"he's", "he is", data)
        data = re.sub(r"there's", "there is", data)
        data = re.sub(r"We're", "We are", data)
        data = re.sub(r"That's", "That is", data)
        data = re.sub(r"won't", "will not", data)
        data = re.sub(r"they're", "they are", data)
        data = re.sub(r"Can't", "Cannot", data)
        data = re.sub(r"wasn't", "was not", data)
        data = re.sub(r"don\x89Ûªt", "do not", data)
        data= re.sub(r"aren't", "are not", data)
        data = re.sub(r"isn't", "is not", data)
        data = re.sub(r"What's", "What is", data)
        data = re.sub(r"haven't", "have not", data)
        data = re.sub(r"hasn't", "has not", data)
        data = re.sub(r"There's", "There is", data)
        data = re.sub(r"He's", "He is", data)
        data = re.sub(r"It's", "It is", data)
        data = re.sub(r"You're", "You are", data)
        data = re.sub(r"I'M", "I am", data)
        data = re.sub(r"shouldn't", "should not", data)
        data = re.sub(r"wouldn't", "would not", data)
        data = re.sub(r"i'm", "I am", data)
        data = re.sub(r"I\x89Ûªm", "I am", data)
        data = re.sub(r"I'm", "I am", data)
        data = re.sub(r"Isn't", "is not", data)
        data = re.sub(r"Here's", "Here is", data)
        data = re.sub(r"you've", "you have", data)
        data = re.sub(r"you\x89Ûªve", "you have", data)
        data = re.sub(r"we're", "we are", data)
        data = re.sub(r"what's", "what is", data)
        data = re.sub(r"couldn't", "could not", data)
        data = re.sub(r"we've", "we have", data)
        data = re.sub(r"it\x89Ûªs", "it is", data)
        data = re.sub(r"doesn\x89Ûªt", "does not", data)
        data = re.sub(r"It\x89Ûªs", "It is", data)
        data = re.sub(r"Here\x89Ûªs", "Here is", data)
        data = re.sub(r"who's", "who is", data)
        data = re.sub(r"I\x89Ûªve", "I have", data)
        data = re.sub(r"y'all", "you all", data)
        data = re.sub(r"can\x89Ûªt", "cannot", data)
        data = re.sub(r"would've", "would have", data)
        data = re.sub(r"it'll", "it will", data)
        data = re.sub(r"we'll", "we will", data)
        data = re.sub(r"wouldn\x89Ûªt", "would not", data)
        data = re.sub(r"We've", "We have", data)
        data = re.sub(r"he'll", "he will", data)
        data = re.sub(r"Y'all", "You all", data)
        data = re.sub(r"Weren't", "Were not", data)
        data = re.sub(r"Didn't", "Did not", data)
        data = re.sub(r"they'll", "they will", data)
        data = re.sub(r"they'd", "they would", data)
        data = re.sub(r"DON'T", "DO NOT", data)
        data = re.sub(r"That\x89Ûªs", "That is", data)
        data = re.sub(r"they've", "they have", data)
        data = re.sub(r"i'd", "I would", data)
        data = re.sub(r"should've", "should have", data)
        data = re.sub(r"You\x89Ûªre", "You are", data)
        data = re.sub(r"where's", "where is", data)
        data = re.sub(r"Don\x89Ûªt", "Do not", data)
        data = re.sub(r"we'd", "we would", data)
        data = re.sub(r"i'll", "I will", data)
        data = re.sub(r"weren't", "were not", data)
        data = re.sub(r"They're", "They are", data)
        data = re.sub(r"Can\x89Ûªt", "Cannot", data)
        data = re.sub(r"you\x89Ûªll", "you will", data)
        data = re.sub(r"I\x89Ûªd", "I would", data)
        data = re.sub(r"let's", "let us", data)
        data = re.sub(r"it's", "it is", data)
        data = re.sub(r"can't", "cannot", data)
        data = re.sub(r"don't", "do not", data)
        data = re.sub(r"you're", "you are", data)
        data = re.sub(r"i've", "I have", data)
        data = re.sub(r"that's", "that is", data)
        data = re.sub(r"i'll", "I will", data)
        data = re.sub(r"doesn't", "does not",data)
        data = re.sub(r"i'd", "I would", data)
        data = re.sub(r"didn't", "did not", data)
        data = re.sub(r"ain't", "am not", data)
        data = re.sub(r"you'll", "you will", data)
        data = re.sub(r"I've", "I have", data)
        data = re.sub(r"Don't", "do not", data)
        data = re.sub(r"I'll", "I will", data)
        data = re.sub(r"I'd", "I would", data)
        data = re.sub(r"Let's", "Let us", data)
        data = re.sub(r"you'd", "You would", data)
        data = re.sub(r"It's", "It is", data)
        data = re.sub(r"Ain't", "am not", data)
        data = re.sub(r"Haven't", "Have not", data)
        data = re.sub(r"Could've", "Could have", data)
        data = re.sub(r"youve", "you have", data)  
        data = re.sub(r"donå«t", "do not", data)
        
        return data
    
    
    #cleaning Urls
    def remove_urls(data):
        clean_url_regex = re.compile(r"http\S+|www\.\S+")
        data = clean_url_regex.sub(r"", data)
        return data
    
    
    #cleaning noisy data
    def remove_noisy_char(data):
        data = data.replace("\\r", "").strip()
        return data
    
    
    #cleaning HTML tags
    def remove_HTML_tags(data):
        soup = BeautifulSoup(data, 'html.parser') 
        return soup.get_text()
        
        
    #cleaning emojis   
    def remove_emojis(data):
        emoji_clean= re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
        
        data = emoji_clean.sub(r"",data)
        return data
    
    
    #cleaning unicode characters
    """
    def remove_unicode_chars(data):
        data = (data.encode('ascii', 'ignore')).decode("utf-8")
        return data
    """
    
    
    #cleaning punctuations
    def remove_punctuations(data):
        #clean_punct_regex = re.compile(r"[^\w\s\d]+")
        clean_punct_regex = re.compile(r"[^a-zA-Z0-9\s]+")
        data = clean_punct_regex.sub(r" ", data)
                        
        #credits - https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string
        #data = data.translate(str.maketrans('', '', string.punctuation))   
        return data
    
    
    #cleaning numeric characters
    def remove_numerics(data):
        #clean_num_regex = re.compile(r"[^A-Za-z]+")
        #data = clean_num_regex.sub(r" ", data)
        #clean_alphanum_regex = re.compile(r"\S*\d\S*")
        #data = clean_alphanum_regex.sub(r"", data)
        
        clean_num_regex = re.compile(r"\b[0-9]+\b")
        data = clean_num_regex.sub(r"", data)
        return data
    
    def remove_single_chars(data):
        #credits - https://stackoverflow.com/questions/42066352/python-regex-to-replace-all-single-word-characters-in-string
        clean_single_len_regex = re.compile(r"\b[a-zA-Z]\b")
        data = clean_single_len_regex.sub(r"", data)
        return data
    
    
    #cleaning unwanted whitespaces
    def remove_redundant_whiteSpaces(data):
        clean_redundant_whitespaces_regex = re.compile(r"\s\s+") #check for more consecutive spaces
        data = clean_redundant_whitespaces_regex.sub(r" ", data) #replace with single space
        return data
    
    
    #cleaning stopwords except 'not'
    def remove_stopwords(data):
        data = ' '.join(word.lower() for word in data.split() if word.lower() != 'not' if word.lower() not in STOP_WORDS)
        data = data.strip()
        return data
    
    
    #cleaning long length words (greater than 30 chars)
    def remove_long_length_tokens(data):
        data = ' '.join(word.lower() for word in data.split() if len(word) <= 30)
        data = data.strip()
        return data
    
    
    df[clean_col]= df[col].apply(remove_contractions)
    df[clean_col]= df[clean_col].apply(remove_urls)
    df[clean_col]= df[clean_col].apply(remove_noisy_char)
    df[clean_col]= df[clean_col].apply(remove_HTML_tags)
    df[clean_col]= df[clean_col].apply(remove_emojis)
    #df[clean_col]= df[clean_col].apply(remove_unicode_chars)
    df[clean_col]= df[clean_col].apply(remove_punctuations)
    df[clean_col]= df[clean_col].apply(remove_numerics)
    df[clean_col]= df[clean_col].apply(remove_single_chars)
    df[clean_col]= df[clean_col].apply(remove_redundant_whiteSpaces)
    df[clean_col]= df[clean_col].apply(remove_stopwords)
    df[clean_col]= df[clean_col].apply(remove_long_length_tokens)
    
    
#     df[col]= df[col].apply(lambda text: remove_contractions(text))
#     df[col]= df[col].apply(lambda text: remove_urls(text))
#     df[col]= df[col].apply(lambda text: remove_noisy_char(text))
#     df[col]= df[col].apply(lambda text: remove_HTML_tags(text))
#     df[col]= df[col].apply(lambda text: remove_emojis(text))
#     #df[col]= df[col].apply(lambda text: remove_unicode_chars(text))
#     df[col]= df[col].apply(lambda text: remove_punctuations(text))
#     df[col]= df[col].apply(lambda text: remove_numerics(text))
#     df[col]= df[col].apply(lambda text: remove_single_chars(text))
#     df[col]= df[col].apply(lambda text: remove_redundant_whiteSpaces(text))
#     df[col]= df[col].apply(lambda text: remove_stopwords(text))
#     df[col]= df[col].apply(lambda text: remove_long_length_tokens(text))

    
    return df

In [None]:
start_time = time.clock()
train_df = text_cleaning(train_df, 'text', 'clean_text')
print("time required text_cleaning :", time.clock() - start_time, "sec.")

In [None]:
i = 6
#check the text data before text-preprocessing
print("text data at index {0} before text pre-processing : \n\n {1}".format(i, train_df.text.iloc[i]))
print("\n\n")

#check the cleaned text data post text-preprocessing
print("text data at index {0} post text pre-processing : \n\n {1}".format(i,train_df.clean_text.iloc[i]))

## Closely analysing Body text post cleanup

In [None]:
# check the sample bug data with most number of punctuations post cleanup

print('Bug: \n Sample Bug data (with most number of punctuations = {}) post text cleanup. \n'.format(count_text_punctuations_Bug.max()))
print(train_df.iloc[count_text_punctuations_Bug.idxmax()]['clean_text'])

In [None]:
#check sample feature data with most number of punctuations post cleanup

print('Feature: \n Sample Feature data (with most number of punctuations = {}) post text cleanup. \n'.format(count_text_punctuations_Feature.max()))
print(train_df.iloc[count_text_punctuations_Feature.idxmax()]['clean_text'])

In [None]:
#check sample question data with most number of punctuations post cleanup

print('Feature: \n Sample Question data (with most number of punctuations = {}) post text cleanup. \n'.format(count_text_punctuations_Question.max()))
print(train_df.iloc[count_text_punctuations_Question.idxmax()]['clean_text'])

#### Inference:

- we can also observe repeatation of words in certain data records post cleaning.

In [None]:
## Identifying long length words having length > 30 chars in 'Text' data post cleaning

# text_data = train_df.clean_text.values
# text_data

# # crete a corpus of long length words having length > 20 chars

# long_length_word_corpus = [[w for w in txt.split() if len(w)>20] for txt in text_data]
# long_length_word_corpus = list(filter(None,long_length_word_corpus))
# long_length_word_corpus = list(itertools.chain.from_iterable(long_length_word_corpus))
# long_length_word_corpus = list(set(long_length_word_corpus))
# long_length_word_corpus
# print("Total number of unique longest length words = {}".format(len(long_length_word_corpus)))

## Plotting WordClouds for cleaned 'Text' data

In [None]:
##%%time

labels = np.unique(train_df.label.values)
label_names = ['Bug','Feature','Question']
for i in tqdm(labels):
    generate_wordcloud(train_df,'clean_text',i,label_names[i])

## Gram Analysis on the cleaned 'Text' data

In [None]:
##%%time

labels = np.unique(train_df.label.values)
label_names = ['Bug','Feature','Question']

for gram_type, n_gram in tqdm(zip(('Bi-Gram','Tri-Gram','Penta-Gram'),(2,3,5))):
    for i in labels:
        gram_freq(train_df,gram_type,n_gram,'clean_text',i,label_names[i])

## Plot the Most Common Words Count for each label categories on the cleaned 'Text' data

In [None]:
##%%time

labels = np.unique(train_df.label.values)
label_names = ['Bug','Feature','Question']
for i in tqdm(labels):
    plot_common_words(train_df,'clean_text',i,label_names[i])

### Inference:
- Thus we can observe that all the noisy data and punctuations characters are removed from the 'text' data.
- Text preprocessing have resulted in generation of some unusually long fetures.
- Overall the data looks well cleaned up now and ready for next phase of word embeddings and model building.

## Lemmatization & Stemming :

- Till now we have cleaned the data and removed all the redundant text and noise from it. This reduced the dimensionality of the data to certain extent.
- Next up, we will prune some words to their roots which will again reduce the length of sentences.
- Here we are applying Lemmatization to reduce the words to their morphological roots so as to retain the symantics of the text.
- We can also apply 'Stemming' in this case, but a stemmer implementation will not retain symantic meaning of the words and will result in reducing the words to their non-dictionary roots.

In [None]:
#lemmatizing the text data
def lemmatize_corpus(data, method = 'wordnet'):
    if method == 'spacy':
        out_data = " ".join([token.lemma_ for token in nlp_spcy(data)])
    else:
        lemmatizer=WordNetLemmatizer()
        out_data = ' '.join(lemmatizer.lemmatize(word) for word in data.split())
        
        return out_data


#stemming the text data
def stem_traincorpus(data):
    pstemmer = PorterStemmer()
    out_data = ' '.join(pstemmer.stem(word) for word in data.split())
    return out_data 

In [None]:
#check the cleaned text data before lemmatization/stemming
i = 4

print("text data at index {0} before text lemmatization/stemming : \n\n {1}".format(i,train_df.clean_text.iloc[i]))

In [None]:
#apply lemmatization
start_time = time.clock()
train_df['lemmatized_text'] = train_df['clean_text'].apply(lemmatize_corpus, method='wordnet')
print("time required lemmatizing text :", time.clock() - start_time, "sec.")
print("\n")

In [None]:
#check the cleaned text data post Lemmatization
print("text data at index {0} post text Lemmatization : \n\n {1}".format(i,train_df.lemmatized_text.iloc[i]))

In [None]:
from nltk.stem import PorterStemmer
#apply stemming
start_time = time.clock()
train_df['stemmed_text'] = train_df['clean_text'].apply(stem_traincorpus)
print("time required stemming text :", time.clock() - start_time, "sec.")
print("\n")

In [None]:
#check the cleaned text data post Lemmatization
print("text data at index {0} post text Stemming : \n\n {1}".format(i,train_df.stemmed_text.iloc[i]))

In [None]:
# Check overall train_data

train_df.head(5)

## End Notes :

- In this notebbok, I have tried to uncover the some hiddens insights in the text data by carrying out an in-depth exploratory data analysis (EDA).
- Next up, I will be implementing the word embeddings and statistical model building.
- You can refer the 2nd part of the the Git Hub Bug Prediction problem series [here](https://www.kaggle.com/gauravharamkar/github-static-semantic-word-mbeddings).