# Text Clasification I (week 4)

This lab is prepared with the materials in the article "A Comprehensive Guide to Understand and Implement Text Classification in Python" https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

Load libraries for dataset preparation, feature engineering, model training 

In [1]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

!pip3 install TextBlob
!pip3 install pandas
import pandas, numpy, textblob, string  
import nltk

# load functions from textpreprocess.py
from textpreprocess import denoise_text, normalize, replace_contractions, remove_non_ascii, to_lowercase, remove_punctuation, replace_numbers, remove_stopwords



## 1. Dataset preparation
We are using the dataset of amazon reviews which can be downloaded at this link https://gist.github.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235. The dataset consists of 10,000 text reviews and their labels, To prepare the dataset, load the downloaded data into a pandas dataframe containing two columns – text and label. Label (1) - negative, Label (2) - positive.

In [2]:
# load the dataset
data = open('data/corpus', encoding="utf-8").read()
labels, texts = [], []

# preprocess the data using functions available in textpreprocess.py
for i, line in enumerate(data.split("\n")):
    line = replace_contractions(line) # Replace contractions in string of text
    content = nltk.word_tokenize(line)
    labels.append(content[0]) # add first item in content as label
    words = content[1:] # add words as 2nd item onwards
    words = remove_non_ascii(words)
    #words = to_lowercase(words)
    words = remove_punctuation(words)
    #words = replace_numbers(words)
    #words = remove_stopwords(words)
    texts.append(words) # add words to text List

# create a dataframe (table format) using texts and labels
trainDF = pandas.DataFrame()
texts1=[' '.join(line) for line in texts] # join words in each line with space character
trainDF['text'] = texts1 # add text into dataframe ("text column")
trainDF['label'] = labels # add label into dataframe ("labels column")

In [3]:
texts

[['Stuning',
  'even',
  'for',
  'the',
  'nongamer',
  'This',
  'sound',
  'track',
  'was',
  'beautiful',
  'It',
  'paints',
  'the',
  'senery',
  'in',
  'your',
  'mind',
  'so',
  'well',
  'I',
  'would',
  'recomend',
  'it',
  'even',
  'to',
  'people',
  'who',
  'hate',
  'vid',
  'game',
  'music',
  'I',
  'have',
  'played',
  'the',
  'game',
  'Chrono',
  'Cross',
  'but',
  'out',
  'of',
  'all',
  'of',
  'the',
  'games',
  'I',
  'have',
  'ever',
  'played',
  'it',
  'has',
  'the',
  'best',
  'music',
  'It',
  'backs',
  'away',
  'from',
  'crude',
  'keyboarding',
  'and',
  'takes',
  'a',
  'fresher',
  'step',
  'with',
  'grate',
  'guitars',
  'and',
  'soulful',
  'orchestras',
  'It',
  'would',
  'impress',
  'anyone',
  'who',
  'cares',
  'to',
  'listen',
  '_'],
 ['The',
  'best',
  'soundtrack',
  'ever',
  'to',
  'anything',
  'I',
  'am',
  'reading',
  'a',
  'lot',
  'of',
  'reviews',
  'saying',
  'that',
  'this',
  'is',
  'the',
 

In [4]:
texts1

['Stuning even for the nongamer This sound track was beautiful It paints the senery in your mind so well I would recomend it even to people who hate vid game music I have played the game Chrono Cross but out of all of the games I have ever played it has the best music It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras It would impress anyone who cares to listen _',
 'The best soundtrack ever to anything I am reading a lot of reviews saying that this is the best game soundtrack and I figured that I would write a review to disagree a bit This in my opinino is Yasunori Mitsuda s ultimate masterpiece The music is timeless and I am been listening to it for years now and its beauty simply refuses to fadeThe price tag on this is pretty staggering I must say but if you are going to buy any cd for this much money this is the only one that I feel would be worth every penny',
 'Amazing This soundtrack is my favorite music of all time hands down

In [5]:
trainDF.head() # return first 5 rows

Unnamed: 0,text,label
0,Stuning even for the nongamer This sound track...,__label__2
1,The best soundtrack ever to anything I am read...,__label__2
2,Amazing This soundtrack is my favorite music o...,__label__2
3,Excellent Soundtrack I truly like this soundtr...,__label__2
4,Remember Pull Your Jaw Off The Floor After Hea...,__label__2


In [6]:
trainDF.tail() # return last 5 rows

Unnamed: 0,text,label
9995,A revelation of life in small town America in ...,__label__2
9996,Great biography of a very interesting journali...,__label__2
9997,Interesting Subject Poor Presentation you woul...,__label__1
9998,do not buy The box looked used and it is obvio...,__label__1
9999,Beautiful Pen and Fast Delivery The pen was sh...,__label__2


In [7]:
trainDF.shape # 10,000 rows and 2 columns

(10000, 2)

In [8]:
trainDF['label'].value_counts() # count number of unique values 

__label__1    5097
__label__2    4903
Name: label, dtype: int64

Next, we will split the dataset into training and validation sets so that we can train and test classifier. Also, we will encode our target column so that it can be used in machine learning models.