# Exploratory Data Analysis focused on text content

In this kernel I will explore the available text data for PetFinder.com competition.

The objective is to get any insights from available text content.

## Text data source

We have some text content into:
* **Description** feature (train.csv): a short description about the pet.


## Definitions

The following terms are used in this kernel:
* Corpus: a corpus (plural corpora) or text corpus is a large and structured set of texts.
* n-gram: a fragment of text consisting of 1 to n words, considering an n-gram as a single unit.
* Unigram: a 1-gram (e.g. “cuteness”)
* Bigram: a 2-gram (e.g. “Guard dog”).
* Trigram is a 3-gram (e.g. “Domestic Short Hair”).
* Term Frequency: summarizes how often a given word appears within a document.
* Inverse Document Frequency: downscales words that appear a lot across documents.
* TF-IDF: short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. 
* Stop words: are some of the most common, short function words, such as the, is, at, which, and on. Usually we remove them from our text input.

## Import libraries

Some libraries like sklearn to help us into the text analysis. 

In [None]:
# For TF-IDF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords 

In [None]:
# For Word Count Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
print(os.listdir("../input/train"))

## About the data

In [None]:
# loading CSV data to check the Description content
train_csv = pd.read_csv("../input/train/train.csv")
train_csv.head()

### Showing 20 random description samples

In [None]:
train_csv.sample(20).Description

### Non-english text content

Some descriptions are not in English language, so we will need to clean it for the analysis.

In [None]:
train_csv.loc[[6440,10779]].Description

### Removing records without Description

In [None]:
print('There are ' + str(len(train_csv[train_csv.Description.isna()])) + ' records without description. They were removed!')
train_csv.dropna(subset=['Description'], inplace=True)

# Words Count

Showing the words count from Description content.

In [None]:
def get_top_n_words(corpus, n=None):
    """
    List the top n words in a vocabulary according to occurrence in a text corpus.
    """
    vec = CountVectorizer(
            strip_accents='unicode',
            analyzer='word',
            token_pattern=r'\w{3,}', # vectorize 3-character words or more
            stop_words='english',
            ngram_range=(1, 2),
            max_features=30000
        ).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

words_count_by_adoption_speed = []
for adoption_speed in range(5):
    descriptions_by_adoption_speed = train_csv[train_csv.AdoptionSpeed == adoption_speed].Description
    top_words = get_top_n_words(descriptions_by_adoption_speed, 25)
    words_count_by_adoption_speed.append(pd.DataFrame(top_words, columns = ['Word', 'Count'])) 
    words_count_by_adoption_speed[adoption_speed].plot.bar(x='Word',y='Count',title="Top 25 words X Adoption Speed " + str(adoption_speed))

# to be continued...

We can explore tecniques like stemming to improve
