# Text analysis


This notebook aims to create a simple, repeatable text analysis pipeline.



## Common analysis for pre-processing

In this phase we have to check if there is something that we need to filter out from data, as here the focus is on on pre-processing  and improving the data quality. We will go through following procedure for 

- check the number of columns and rows (filter columns that we don't need)
- check rows with missing data/text
- find the minimum, maximum, and average length of the input text
- presence of multi-lingual text (if focus is only on English data)
- check if we need to clean data (by checking things we don't need in our dataset moving forward)


### Size of dataset

Normally this can be checked via `pandas` package, lets import the library and read dataset.

Here we are using twitter 

In [1]:
import pandas as pd

In [2]:
twitter_data = pd.read_csv("./dataset/twitter/train.csv")

In [3]:
twitter_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
list(twitter_data.columns)

['id', 'label', 'tweet']

In [5]:
len(twitter_data)

31962

Or you can just check `shape` attribute of pandas dataframe, this will give you tuple `(row_size, no_of_column)`

In [6]:
twitter_data.shape

(31962, 3)

Our sample dataset has only ~30k records. If you are working on a real world dataset, your dataset size might be around or more than few millions rows and few hundred columns. You might want to filter out some of the columns and save the dataset separatly to make the your dataframe less cluttered.

Lets assume we are only interested in `label` and `tweet` fields. we can create new dataframe with these two columns by running following code.

In [7]:
main_data = twitter_data[['label', 'tweet']]
main_data.head()

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is s...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in ...
4,0,factsguide: society now #motivation


In [8]:
main_data.to_csv("./dataset/main.csv", index=False)

### Check for missing data

This step is very important if we are trying to pre-process the dataset to train a model. We can not feed a model with missing values. So we must remove any empty or missing values.

In [9]:
# lets define the null or empty values. this can be different for different dataset
na_values = [None, "null", "", " "]

In [10]:
twitter_data [(twitter_data['tweet'].isin(na_values)) |  (twitter_data['label'].isin(na_values))]

Unnamed: 0,id,label,tweet


## Getting to know the data

In this section we will dive deep inside the dataset to understand the underlying patterns the data set is trying to tell. 

- presence of toxic comment or text (to check if data is safe for further usage)
- check the overall sentiment of each row of text
- check the common theme of the text


## EDA


After keeping your topic of concern in mind, create more curious questions regarding the type of data you are trying to analyze. If the dataset is related to a particular field, create multiple curious questions to better understand your dataset and find meaningful insights.

For example, if you have a customer review dataset, you can identify which country/city has the most negative reviews, which would be helpful for understanding where the company needs to focus more generally. Similarly, you can think of common issues mentioned in customer reviews to fix issues.


