<a href="https://colab.research.google.com/github/wizard339/education/blob/main/misis/nlp/text_classification/transfer_learning_nlp_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

from sklearn.preprocessing import LabelEncoder

%matplotlib inline

## Loading data, EDA, data preprocessing

### Loading data

In [37]:
train_raw_data = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/wizard339/education/main/misis/nlp/text_classification/train.csv', index_col=0)
final_test_raw_data = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/wizard339/education/main/misis/nlp/text_classification/test.csv')

print(f'Shape of train: {train_raw_data.shape}')
print(f'Shape of test: {final_test_raw_data.shape}')

Shape of train: (41159, 2)
Shape of test: (3798, 2)


### Working with missing data

Let's look at the data:


In [38]:
train_raw_data.head()

Unnamed: 0,Text,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [39]:
train_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41159 entries, 0 to 41156
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Text       41158 non-null  object
 1   Sentiment  41155 non-null  object
dtypes: object(2)
memory usage: 964.7+ KB


We can see that there are rows with missing data. Let's look at them:

In [40]:
train_raw_data[train_raw_data['Sentiment'].isnull() == True]

Unnamed: 0,Text,Sentiment
33122,@PrivyCouncilCA #SocialDistancing isnÂt enoug...,
,Neutral,
39204,@TanDhesi @foreignoffice @Afzal4Gorton @Expres...,
Neutral,,


Let's drop these rows from our DataFrame because they don't carry much value and let's look at the data again:

In [41]:
train_raw_data = train_raw_data.dropna().reset_index(drop=True)
print(f'New shape of train: {train_raw_data.shape}')

New shape of train: (41155, 2)


In [42]:
train_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41155 entries, 0 to 41154
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Text       41155 non-null  object
 1   Sentiment  41155 non-null  object
dtypes: object(2)
memory usage: 643.2+ KB


### Label encoding of target column

In [43]:
train_raw_data['Sentiment'].value_counts()

Positive              11422
Negative               9917
Neutral                7711
Extremely Positive     6624
Extremely Negative     5481
Name: Sentiment, dtype: int64

In [44]:
le = LabelEncoder()
le.fit(train_raw_data['Sentiment'])
train_raw_data['Sentiment'] = le.transform(train_raw_data['Sentiment'])
train_raw_data.head(5)

Unnamed: 0,Text,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,3
1,advice Talk to your neighbours family to excha...,4
2,Coronavirus Australia: Woolworths to give elde...,4
3,My food stock is not the only one which is emp...,4
4,"Me, ready to go at supermarket during the #COV...",0


### Сleaning text from useless data

Let's look at the text more closely:

In [51]:
train_raw_data['Text'].sample(10).values

array(['Here are three ways COVID-19 is killing consumer Christianity. \r\r\n\r\r\nhttps://t.co/kJEpIIiRZi',
       'Stood Up John buys toilet paper...  Just a little humor to help us get through these difficult times. Stay Safe! #StoodUpJohn  #ToiletPaper #humor #laugh #smile #comedy #joke #coffee #comic  #quarantine #CoronaVirus #virus #COVID19 #pandemic #toiletpaper #ToiletPaperPanic https://t.co/ZZKXpXSNuH',
       'Oh good. So prices will rise. #Covid_19',
       'We are receiving a high volume of phone calls and ask callers to consider if their enquiry is essential, as our staff are working remotely and/or supporting preparation of COVID-19 consumer guidance as volunteers. Essential stories relate to the virus and public/employee health. https://t.co/M3T7Rg341X',
       'How are rising food prices helping the Cdns',
       'As COVID-19 has drastically reduced the volume of automobile traffic, the May Ethanol price has dropped 30%.  Meanwhile, the May Corn price has only dropped 7