<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cleaning-Notebook" data-toc-modified-id="Cleaning-Notebook-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cleaning Notebook</a></span><ul class="toc-item"><li><span><a href="#Importing-libarires" data-toc-modified-id="Importing-libarires-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Importing libarires</a></span></li><li><span><a href="#Reading-in-the-data" data-toc-modified-id="Reading-in-the-data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Reading in the data</a></span></li><li><span><a href="#Sorting-the-Data" data-toc-modified-id="Sorting-the-Data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Sorting the Data</a></span><ul class="toc-item"><li><span><a href="#Configuring-Data-Types" data-toc-modified-id="Configuring-Data-Types-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Configuring Data Types</a></span></li></ul></li><li><span><a href="#Handling-Nulls:" data-toc-modified-id="Handling-Nulls:-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Handling Nulls:</a></span></li><li><span><a href="#Natural-Language-Processing-Cleaning" data-toc-modified-id="Natural-Language-Processing-Cleaning-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Natural Language Processing Cleaning</a></span></li><li><span><a href="#Exporting-Data" data-toc-modified-id="Exporting-Data-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Exporting Data</a></span></li></ul></li></ul></div>

# Cleaning Notebook

Before conducting EDA and building the data needs to be formatted in a way that won't throw errors. In a combination of intial exploration and later EDA the following steps are needed:

1. Creating Columns 
2. Converting to Appropriate Datatype
3. Handling Nulls
4. Cleaning Text
5. Dealing with Duplicate Values Left Over from Merging Madness

## Importing libarires

For this notebook we'll use pandas, regex (for cleaning text), and seaborn (for visualizations).

In [1]:
import pandas as pd
import regex as re
import seaborn as sns
import matplotlib.pyplot as plt

# from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# from nltk.tokenize import RegexpTokenizer
# from nltk.sentiment.vader import SentimentIntensityAnalyzer
# from sklearn.feature_extraction import stop_words, text

# This creates HD resolution for visualizations

%config InlineBackend.figure_format = 'retina'

## Reading in the data

An initial look at the dataframe reveals that the text for `answers_body` has several HTML artifacts. In the NLP cleaning later we'll remove these as well as clean other text columns are a precaution. 

In [2]:
data = pd.read_csv('./Datasets/df_ques_with_ans_tag_indicator.csv')
print("shape:", data.shape)
data.head(2)

shape: (51944, 7)


Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,was_answered,has_tag
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26 UTC+0000,Teacher career question,What is a maths teacher? what is a ma...,1,1.0
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25 UTC+0000,I want to become an army officer. What can I d...,I am Priyanka from Bangalore . Now am in 10th ...,1,1.0


Data.describe() reveals a quick overview of the numeric columns. A peruse over the stats shows that there are no significant outliers that need to be cleaned.

In [3]:
data.describe()

Unnamed: 0,was_answered,has_tag
count,51944.0,50244.0
mean,0.984195,1.0
std,0.124724,0.0
min,0.0,1.0
25%,1.0,1.0
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


## Sorting the Data

By checking the datatypes I discovered:
    
    1. Dates are saved as *objects* we'll change these to datetime
    2. id's are stored as *objects*
        a. in checking the dataframe above we see that id's use both numbers and letters and should stay *objects* 
        b. That said, when doing our NLP training these should be treated differently than the other *object* columns

In [4]:
data.dtypes

questions_id             object
questions_author_id      object
questions_date_added     object
questions_title          object
questions_body           object
was_answered              int64
has_tag                 float64
dtype: object

### Configuring Data Types

In [5]:
#### Creating a `date_cols` list so we can loop through and convert dates to date_time data type

date_cols = []

for cols in data.columns:
    if "date" in cols:
        date_cols.append(cols)

date_cols

#### Converting Date Columns to *Date* type

for cols in date_cols:
    data[cols] = pd.to_datetime(data[cols])

data.dtypes

questions_id                         object
questions_author_id                  object
questions_date_added    datetime64[ns, UTC]
questions_title                      object
questions_body                       object
was_answered                          int64
has_tag                             float64
dtype: object

Extracting columns which are *object* oriented type and saving them as a list called `str_cols`. We do this so we can easily process all the *object* types in the `cleaning_text` function.


In [6]:
str_cols = data.select_dtypes(include ='object').columns
str_cols

Index(['questions_id', 'questions_author_id', 'questions_title',
       'questions_body'],
      dtype='object')

After looking at the various *object* type columns I realized that the `id` columns shouldn't be treated as a string. Since they include both numbers and letters, we cannot convert them into an integer. So instead let's create a list, `text_cols` that contains all the *object* type columns excluding the `id` columns.

In [7]:
text_cols = [] #create a list

for cols in str_cols: # looping through the `str_cols` variable

    if "id" not in cols: # if `id` isn't in the name 
        text_cols.append(cols) # append to `text_cols` list

text_cols

['questions_title', 'questions_body']

## Handling Nulls:

Many of our models and code do not work when null values are present. We need to either delete or fill in the nulls so that later we won't incur any errors.

A major contributor for nulls in this dataset was from how the data was given to us. There were 15 datasets that did not perfectly match. Initial merging of data sets led to significant duplicate information and an abundance of null values. However, after reviewing the merging strategy it was possible to significantly reduce the number of null values.

Now, there are a couple of ways to deal with null values. We can either remove them or we can fill them in with simulated data. However, simulating data may introduce a loss of statistical confidence depending on the applied analysis. We will do our best to avoid having to simulate data in this case.

First we need to check for nulls.

In [8]:
data.isnull().sum()

questions_id               0
questions_author_id        0
questions_date_added       0
questions_title            0
questions_body             0
was_answered               0
has_tag                 1700
dtype: int64

All the data in the `has_tag` column were '1' when it was initially introduced, with no nulls. Since we did a left merge on the questions dataset we know all the nulls are where questions did not have tags. Therefore I'm filling `has_tag` nulls with 0.

In [9]:
data['has_tag'].fillna(0, inplace=True)

## Natural Language Processing Cleaning

Since we want to explore the text data and build models from the text we need to clean the text of HTML artifacts and standardize the format to get the best results. This will include removing apostrophes, line breaks, and all punctuation so that there is strictly only text.


**Setting up a Text Cleaning Function:**

The function takes in a column and cleans the text for pre-proccessing. It removes html artifacts as well as punctuation and numbers, and converts the text to all lower case. 

In [10]:
def cleaning_text(df, df_col):
    """
    df: is the name of Dataframe 
    df_col: takes in a column name formated as string i.e. "column_name" 
    
    This function takes a column and cleans the text for that column. 
    It removes HTML artifacts, such as <p> and <br>, as well as punctuation 
    and numbers to prepare the text for processing and modeling. 
    In addition it makes all the text lower case.
    It utilizes the .replace method as well as regex. 
    It outputs the top 2 rows
    
    """

    # Uses .str.reaplce
    df[df_col] = df[df_col].str.replace("<p>", "") #removes <p>
    df[df_col] = df[df_col].str.replace("</p>","") #removes </p>
    df[df_col] = df[df_col].str.replace("<br>", "") #removes <br>
    df[df_col] = df[df_col].str.replace("\n", "")   #removes <\n>
    
    #Makes everything lower case
    df[df_col] = df[df_col].str.lower()
    
    # Using regex and lambda 
    df[df_col] = df[df_col].map(lambda x: re.sub('\/\/', ' ', x)) # Removing line breaks
    df[df_col] = df[df_col].map(lambda x: re.sub('[\\][\']', '', x)) # Removing apostrophes
    df[df_col] = df[df_col].map(lambda x: re.sub('[^\w\s]', ' ', x)) # Removing all punctuation 
    df[df_col] = df[df_col].map(lambda x: re.sub('\xa0', ' ', x)) # removing xa0
    df[df_col] = df[df_col].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x)) # removing urls
    
    # Keeping numbers for now, if we want to strip numbers, use the below
    df[df_col] = df[df_col].map(lambda x: re.sub("[^a-zA-Z]", " ", x)) # Removes all numbers only keeping letters
    
    #Displays the top 2 rows
    return df[df_col].head(2)

**Clean the Data**

For loop interates through `text_cols` list (all the *object* type columns) and cleans the text using `cleaning_text()` function. 

In [11]:
for cols in text_cols:
    if data[cols].isnull().sum() == 0: #If there aren't any nulls 
        cleaning_text(data, cols) #Call `cleaning_text` function
    else:
        #print which columns have nulls 
        print(f"{cols} has null values, so we're filling with 'none', then calling `cleaning_tex()` function") 
        
        #Filling nulls with 'none'
        data[cols].fillna('none', inplace =True)
        
        #Then calling the function
        cleaning_text(data, cols) #Call `cleaning_text` function

In [12]:
data.head(2)

Unnamed: 0,questions_id,questions_author_id,questions_date_added,questions_title,questions_body,was_answered,has_tag
0,332a511f1569444485cf7a7a556a5e54,8f6f374ffd834d258ab69d376dd998f5,2016-04-26 11:14:26+00:00,teacher career question,what is a maths teacher what is a ma...,1,1.0
1,eb80205482e4424cad8f16bc25aa2d9c,acccbda28edd4362ab03fb8b6fd2d67b,2016-05-20 16:48:25+00:00,i want to become an army officer what can i d...,i am priyanka from bangalore now am in th ...,1,1.0


## Exporting Data

In [13]:
data.to_csv('./Datasets/cleaned_4_modeling.csv', index=False)