## Step 1 - Importing Libraries


This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . 
The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

Content
It contains the following 6 fields:

    1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
    2. ids: The id of the tweet ( 2087)
    3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
    5. user: the user that tweeted (robotickilldozr)
    6. text: the text of the tweet (Lyx is cool)

In [3]:
# importing libraries
# data manipulation
import pandas as pd
import numpy as np
import re
import string

#methods and stopwords text processing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split

# Machine Learning Libraries
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import warnings
warnings.filterwarnings("ignore")

## English stopwords

In [5]:
# Creating a stopwords set
import nltk
nltk.download('stopwords')
stop_words=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Step 2 - Load Datasets

In [6]:
def load_dataset(filepath,cols):
    '''
    reads the CSV file to return
    a dataframe with specified column names
    '''
    df=pd.read_csv(filepath,encoding='latin-1')
    df.columns=cols
    return df

## Getting rid of unwanted columns

In [9]:
def delete_redundant_cols(df,cols):
    '''
    Delete unwanted columns(cols) from the dataframe.
    '''
    for col in cols:
        del df[col]
    return df

# Preprocessing Tasks
    
    1. Handling casing (all lower cases)
    2. Noise Removal (Special characters, html tags, etc)
    3. Tokenization
    4. Stopword removal (The words which do not make sense are removed out)
    5. Text Normalization (Stemming and Lemmatization)
    
            1. Stemming includes eliminating the affixes (prefixes, suffixes, infixes). Basically to reach the stem of the word or the root meaning of that word. Stemming sometimes loses the actual meaning of the word. Lemmatization here is better because it reduces the infected word properly by ensuring its morphological analysis and vocabulary.