# This notebook is for basic exploration of the training data. The purpose is mainly to demonstrate some simple features of Pandas, such as assign() and eval(), as well as the corr() function. Since this dataset does not have many feature columns, there aren't too many things to be done here.

## Basic imports we'll need

In [None]:
import re

import pandas as pd

## Read in the train data as a Pandas DataFrame and then find some basic info. In order to run this notebook, you'll need the train.csv file in the same directory

In [None]:
train_data = pd.read_csv('../input/train.csv')

In [None]:
train_data.info()

In [None]:
train_data[0:5]

In [None]:
label_columns = [x for x in train_data.columns if x not in['id', 'comment_text']]

## Let's check out the distribution of the labels

In [None]:
for label in label_columns:
    print(f"Count for {label}: {train_data[label].sum()/95851}")

## Define a function to apply to the comment_text column to strip punctuation.

In [None]:
def remove_punctuation(row_str):
    return re.sub(r"\W", " ", row_str)

## Now apply this function to comment_text and observe the result

In [None]:
train_data = train_data.assign(comment_text=train_data.comment_text.apply(remove_punctuation))

In [None]:
train_data[0:10]

## Create a new column that stores the lengths of the comment_text column

In [None]:
train_data = train_data.assign(comment_len=train_data.comment_text.str.len())

## Let's explore the distibution of lengths of comments.

In [None]:
deciles = [x/10.0 for x in range(1, 10)]
train_data.comment_len.describe(percentiles=deciles)

## Could there be a relationship between the length of a comment and its label?

In [None]:
for label in label_columns:
    print("Correlation with comment length for {}: {}".format(label, train_data[label].corr(train_data.comment_len)))

## What if we considered the number of words instead of characters?

## There are two ways we can do this. An absolute word count, and the number of unique words. We'll start with an absolute word count

### Define a function to find the number of words in comment_text

In [None]:
def get_num_words(row_str):
    return len(row_str.split())

## Create the new column

In [None]:
train_data = train_data.assign(num_words=train_data.comment_text.apply(get_num_words))

### And look at the distribution of word counts

In [None]:
train_data.num_words.describe(percentiles=deciles)

### Now do the same thing with number of unique words

### First, let's define a function to apply to the comment_text column to calculate the number of unique words

In [None]:
def get_unique_words(row_str):
    return len(set(row_str.lower().split()))

### Now let's create that column

In [None]:
train_data = train_data.assign(unique_words=train_data.comment_text.apply(get_unique_words))

### What does the distribution of unique words look like?

In [None]:
train_data.unique_words.describe(percentiles=deciles)

### Finally, investigate the relationship between word counts and labels

In [None]:
for label in label_columns:
    print("Correlation with number of words for {}: {}".format(label, train_data[label].corr(train_data.num_words)))

In [None]:
for label in label_columns:
    print("Correlation with unique words for {}: {}".format(label, train_data[label].corr(train_data.unique_words)))

## One more thing would be to look at mean word length

In [None]:
train_data.eval('mean_word_length = comment_len/num_words', inplace=True)

## Once again, checkout the distribution of mean word length values

In [None]:
train_data.mean_word_length.describe(percentiles=deciles)

## There is an obvious outlier given that the max mean word length is three orders of magnitude greater than the 99th Percentile

In [None]:
train_data.mean_word_length.quantile(0.99)

## Any possible correlations between mean word length and label?

In [None]:
for label in label_columns:
    print("Correlation with unique words for {}: {}".format(label, train_data[label].corr(train_data.mean_word_length)))

# In summary, there weren't any obvious connections between various basic string metrics and the label. Using some real NLP techniques such as POS tagging, semantic analysis, and removal of stop words could yield interesting results