<a href="https://colab.research.google.com/github/roiantman/roiantman/blob/main/Sentinement_Analysis_logistic_nlp_techtrain_todo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using logistic regression

In this notebook, you will implement a logistic regression model for sentiment analysis on tweets. You will classify each tweet as positive sentiment or a negative one. 


## Text pre-processing methods

### Stemming and tokanization
As we saw in the presentation , Tokenization is the process of breaking down the given text into the smallest unit in a sentence called a token.
Let's look at some examples, we will use 


In [None]:
import nltk
nltk.download('punkt')

text = "This is an example for our tokenizer. this is great !"
sent_tokens=nltk.sent_tokenize(text)
words_tokens = nltk.word_tokenize(text)
print("sent_tokens:",sent_tokens)
print("words_tokens:",words_tokens)

### StopWords 
During this step, all words that are not actually conveying information are removed. Examples of such words: "the", "and", "it".These words are generally common in every corpus, and don’t give any information on the sentiment of the text. 

In most cases, we can use NLTK’s stop words list `stopwords.words()` , but we might want to add words to this list depending on the context of our problem.



In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stopwords.words("english")[:20]

sometimes we also want to remove punctioation , we can use `string.punctuation`to get list of all punctuation.

In [None]:
import string
string.punctuation

### Stemming 
Stemming is the process of finding the root of words. To apply stemming we cut the suffixes in words according to a certain rule.

By applying ruled-based stemming we have two optional problems : overstemming and understemming.

* **Overstemming** occurs when words are over-truncated. In such cases, the meaning of the word may be distorted or have no meaning.

* **Understemming** occurs when two words are stemmed from the same root that is not of different stems.

To apply stemming we will use Porter Stemmer , if you are interested in more information you can read more about it  here :[Porter Stemming algorithm](https://vijinimallawaarachchi.com/2017/05/09/porter-stemming-algorithm/)

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
word = ("tokenization")
ps.stem(word)

## Data Exploration
We will use The “Twitter Sentiment Analysis” dataset on Kaggle.
is a collection of approximately 74,000 tweets, we will use a sample of the dataset to save time in the course.

In [None]:
from google.colab import files
upload = files.upload()

let's see an example of how our 

In [None]:
import pandas as pd
train_twitters=pd.read_csv("/content/twitter_training.csv")
train_twitters.rename(columns = {'Tweet Content':'Tweet'}, inplace = True)

train_twitters.head(10)

We are interested only in the Tweet Content and it's sentiment.

In [None]:
train_twitters=train_twitters[["Sentiment","Tweet"]]

In [None]:
train_twitters.Sentiment.value_counts()

In [None]:
# remove duplicates
train_twitters=train_twitters.drop_duplicates().reset_index(drop=True)
train_twitters

In [None]:
train_twitters=train_twitters[train_twitters["Sentiment"]!="Irrelevant"].reset_index(drop=True)
train_twitters

In addition, since it is a supervised learning we want to map each "Positive" sentiment to "1" , "Neutral" to 0 and and "Negative" sentiment to -1 .

In [None]:
train_twitters['Sentiment']=train_twitters['Sentiment'].map({"Positive":1,"Neutral":0,"Negative":-1})
train_twitters

In [None]:
# check if there is NaN value 
train_twitters.isna().any()

In [None]:
# drop Nan
train_twitters=train_twitters.dropna()

In [None]:
train_twitters

In [None]:
# we sample to get a subset of data , for saving time so we can learn more stuff ! :)
train_twitters = train_twitters.sample(10000,random_state=0)

In [None]:
train_twitters=train_twitters.reset_index(drop=True)
train_twitters

In [None]:
import seaborn as sns
sns.countplot(x= 'Sentiment',data = train_twitters)

The data is pretty much balanced- we have about the same number of "Positive" and "Negative" samples .

####Removing hyperlinks, twitter marks, styles:
Twitter consists many substrings like hashtag, retweet marks, hyperlinks, which do not participate in sentiment analysis. Removing these substrings is carried through re(regular expression) library in Python.

In [None]:
def remove_hyperlinks(tweet):
    tweet = re.sub(r'http\S+', '', tweet) # remove http links
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet=re.sub(r'pic.twitter.com/\S+', '', tweet)
    tweet=re.sub(r'twitter.com/\S+', '', tweet)
    tweet = re.sub(r'bit.ly/\S+', '', tweet) # rempve bitly links
    tweet = tweet.strip('[link]') # remove [links]
    return tweet

In [None]:
import re
def clean_twitter_styles(tweet):
  # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet=remove_hyperlinks(tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    return tweet

In [None]:
from nltk import TweetTokenizer

def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: for each tweet we retuen a list of clean,proccesed words from the tweet
    """
    tweet=clean_twitter_styles(tweet)
    
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    # TODO - tokenize tweets - you can use nltk special tokenizer for tweet TweetTokenizer
    tweet_tokens = ...

    process_tweet = []
    # TODO - we want for each word in the tweet_tokens - remove stopwords, and punctuation
    # for removing punctiuation you can 
    
    #  TODO - now for each word we want to apply stemming 
    
    return process_tweet


Now we want to apply our function on the Tweet colum to get for each token it's relevant tokens.

This may take a few minutes


In [None]:
train_twitters['Tweet_tokens']=train_twitters["Tweet"].apply(process_tweet)

In [None]:
train_twitters

## Build our model 
We will use a simple logistic regressiong - DO I need to add explanation ?

### Split data -
We will use 75% from the data for training and 25% for validation

In [None]:
from sklearn.model_selection import train_test_split
X = train_twitters['Tweet'] 
y = train_twitters['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state = 0)

In [None]:
X_train=X_train.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
X_test=X_test.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)

### Vectorize words 
To fit our model we need to represent our words as numeric representation . 
We will use Bag of Word as our representation

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(ngram_range=(1,2)) 
# create BOW
X_train_bow = count_vect.fit_transform(X_train) 
X_test_bow = count_vect.transform(X_test)

In [None]:
from sklearn import preprocessing
# normalize bow for better performance 
X_train_bow = preprocessing.normalize(X_train_bow)
X_test_bow = preprocessing.normalize(X_test_bow)

### Choose model 
we will use a simple LogisticRegression 
We chose using the solver and max iteration are hyper parameters. You can change them and fine


In [None]:
from sklearn.linear_model import LogisticRegression
model= LogisticRegression(solver='lbfgs', max_iter=1000)

In [None]:
# To train our model we use model.fit(features,label)
model.fit(X_train_bow,y_train)

## Evluate our model

In [None]:
pred_train = model.predict(X_train_bow)
pred_test = model.predict(X_test_bow)

In [None]:
from sklearn.metrics import confusion_matrix
cm_bow = confusion_matrix(y_test, pred_test)

In [None]:
from mlxtend.plotting import plot_confusion_matrix
import matplotlib.pyplot as plt # plotting

fig, ax = plot_confusion_matrix(conf_mat=cm_bow, figsize=(6, 6))
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
from sklearn.metrics import accuracy_score
accuracy_train=accuracy_score(y_train, pred_train)
accuracy_train

In [None]:
accuracy_train=accuracy_score(y_test, pred_test)
accuracy_train