# EDA for Contradictory, My Dear Watson
Kaggle Challenge: https://www.kaggle.com/c/contradictory-my-dear-watson/notebooks

Prettier Plots: https://bmanohar16.github.io/blog/customizing-bar-plots

![Sherlock Holmes](https://i.ibb.co/k30XzxD/kisspng-john-h-watson-sherlock-holmes-vector-graphics-cli-sherlock-homes-cliparts-free-download-clip.png)

### View files that are part of the challenge

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


import matplotlib.pyplot as plt

### Load DataFrames
Load the DataFrames, and get record count.

In [None]:
train_df = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
print('The Training data has {} records'. format(train_df.shape[0]))
train_df.head()

In [None]:
test_df = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')
print("The Test data has {} records". format(test_df.shape[0]))
display(test_df.head())

### Make sure data has no missing values
Always make sure there are no nulls

In [None]:
train_df.isnull().any()

In [None]:
test_df.isnull().any()

### Languages
There are multiple languages in here.  How often are they used in the data set?

In [None]:
train_df['language'].value_counts()

### Example Premise and hypothesis

In [None]:
english = train_df[train_df['language'] == 'English'].copy()

print('Premise: {}'.format(english.iloc[1]['premise']))
print('Hypothesis: {}'.format(english.iloc[1]['hypothesis']))


# Graphs for Language Distribution
Draw graphs to better see how the distribution is looking.

In [None]:
def drawGraph(x, y, title):
    
    # Figure Size
    fig, ax = plt.subplots(figsize=(15,10))

    # Horizontal Bar Plot
    ax.barh(x, y, color='crimson')

    # Remove axes splines
    for s in ['top','bottom','left','right']:
        ax.spines[s].set_visible(False)

    # Remove x,y Ticks
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')

    # Add padding between axes and labels
    ax.xaxis.set_tick_params(pad=5)
    ax.yaxis.set_tick_params(pad=10)

    # Add x,y gridlines
    ax.grid(b=True, color='grey', linestyle='-.', linewidth=0.5, alpha=0.2)

    # Show top values 
    ax.invert_yaxis()

    # Add Plot Title
    ax.set_title(title,
                 loc='left', pad=10)

    # Add annotation to bars
    for i in ax.patches:
        ax.text(i.get_width()+100, i.get_y()+0.5, str(round((i.get_width()), 2)),
                fontsize=10, fontweight='bold', color='grey')

    # Add Text watermark
    fig.text(0.9, 0.15, '@introvertedspud', fontsize=12, color='grey',
             ha='right', va='bottom', alpha=0.5)

    # Save Plot as image
    fig.savefig('Top Reasons for Requesting Loan.png', dpi=100,
                bbox_inches='tight')

    # Show Plot
    plt.show()

In [None]:
drawGraph(train_df['language'].value_counts().index, train_df['language'].value_counts(), 'Language Distribution for Train Data')

In [None]:
drawGraph(test_df['language'].value_counts().index, test_df['language'].value_counts(), 'Language Distribution for Test Data')

# Graph Training Data Label Distribution
How often are each labels used for the training data?

In [None]:
unqiues = train_df['label'].value_counts().index
uniqueString = []
for x,y in enumerate(unqiues):
    uniqueString.append(str(y))

drawGraph(uniqueString, train_df['label'].value_counts(), 'Label Distribution for Training Data')

In [None]:
def drawHistogram(records, col, title):
    fig, ax = plt.subplots()
    fig = plt.gcf()
    fig.set_size_inches(15,10)
    plt.hist(records, 10, density=False, color='crimson', log=True)
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.title(title)
    plt.grid(True)
    plt.show()

# Word Count Histogram
How many words are there for each record?

In [None]:
drawHistogram(train_df['premise'].str.count(' ') + 1, "Word Count", "Word Count Frequency for 'Premise' Training Data")

In [None]:
drawHistogram(train_df['hypothesis'].str.count(' ') + 1, "Word Count", "Word Count Frequency for 'Hypothesis' Training Data")

In [None]:
drawHistogram(test_df['premise'].str.count(' ') + 1, "Word Count", "Word Count Frequency for 'Premise' Test Data")

In [None]:
drawHistogram(train_df['hypothesis'].str.count(' ') + 1, "Word Count", "Word Count Frequency for 'Hypothesis' Training Data")

# Thank you
Thank you for taking a look at my EDA for Contradictory, My Dear Watson.  I would love feedback.  Good luck on the competition!