<p style = {font-weight: 'bold';}>We will be trying to understand sentiment of tweets about the company Apple. By using the twitter data we can hope to understand the public perception a bit better.

Our challenge is to see if we can correctly classify tweets as being either positive or negative.

Problem Statement:
•	Correctly classify the tweets as being positive or negative.</p>

# Using: nltk.NaiveBayesClassifier

## Load the libraries

In [1]:
## Importing the necessary libraries along with the standard import

import numpy as np 
import pandas as pd 
import re # this is the regular expression library which helps us search for or extract matching patterns from a given string
import nltk # this is the Natural Language Tool Kit which contains a lot of functionalities for text analytics
import matplotlib.pyplot as plt
import string # this is used for string manipulations
import warnings
warnings.filterwarnings('ignore')

Load the csv file available in the working or specified directory

## Load the Dataset

In [2]:
## Loading the dataset

Apple_tweets = pd.read_csv("Apple_tweets.csv")

In [3]:
pd.read_csv()

TypeError: read_csv() missing 1 required positional argument: 'filepath_or_buffer'

In [None]:
## Checking the first 5 rows of the dataset

Apple_tweets.head()

## EDA & Clean-up

### Drop non-alphanumeric & space

In [None]:
Apple_tweets['Tweet'] = Apple_tweets['Tweet'].str.replace('[^\w\s]','')
# \w: Returns a match where the string contains any alphanumeric characters (characters from a to Z, digits from 0-9, and the underscore _ character)
# \s: Returns a match where the string contains a white space character.
# [^]: Returns a match for any character EXCEPT what is written after it.
Apple_tweets

### Convert to lower case

In [None]:
## Converting all the words to lower case

Apple_tweets['Tweet'] = Apple_tweets['Tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
Apple_tweets

### Convert Target Variable to Categorical

In [None]:
## We are defining a function to convert the 'Avg' column into a column with two classes which will be treated as the target variable later

def get_senti(x): 
    if x >= 0: 
        return "Positive" 
    else: 
        return "Negative" 

 # you can also use np.where() to get the same task done

In [None]:
# Applying the defined function on the column 'Avg' and creating a new column called 'Sentiment'
Apple_tweets["Sentiment"] = Apple_tweets["Avg"].apply(get_senti)

# Dropping the 'Avg' column from the data frame
Apple_tweets.drop("Avg",axis=1,inplace=True)

In [None]:
Apple_tweets

### Randomize the rows

We see that the newly created 'Sentiment' variable has all the positive entries one after the other and all the negative entries after. Since we need to split the data into training and test randomly we have to jumble up the data set. We will use the DataFrame.sample() function.

###### Note: We are not using the train-test split function from sklearn and hence the need to jumble the data set.

In [None]:
Apple_tweets=Apple_tweets.sample(frac=1,random_state=3).reset_index(drop=True)
#pd.sample()=Return a random sample of items from an axis of object.
#random_state:we used random_state for reproducibility.
#frac=1: is used to generate random sample for whole of the dataset (without replacement)
#Reset_index: To reset the index as it got shuffled.
#Drop: We used it to drop the previous index

In [None]:
Apple_tweets.head()

### Collect words from all the tweets

In [None]:
all_Words = (' '.join(Apple_tweets['Tweet'])).split() # tweets joined to eacch other using a space---> one long string with all the tweets connected. Then split this into individual words

In [None]:
# all_Words = [x for x in (' '.join(Apple_tweets['Tweet']).split())]

### Check Frequency of Words

In [None]:
nltk.FreqDist(all_Words).most_common(10)

### Remove Punctuations & Stop Words

In [None]:
string.punctuation

In [None]:
# Defining a variable 'stopwords' which contains the list of punctuations from the string library and the english stopwords
# from nltk
stopwords = nltk.corpus.stopwords.words('english') +list(string.punctuation)

# Only keeping the words which are not the 'stopwords'
all_words_clean = [word for word in all_Words if word not in stopwords]


# Creating a frequency distribution of the lower case words which does not contain any stopwords
all_words_freq = nltk.FreqDist(all_words_clean)

### Check Frequency of Cleaned List

In [None]:
all_words_freq

In [None]:
len(all_words_freq)

### Take top 2000 words as Features

In [None]:
# Extracting the  most common 2000 words after the list of words have been converted to lowercase and the stopwords have been removed
word_features = [item[0] for item in all_words_freq.most_common(2000)]

In [None]:
word_features

### Function to Check presence of each Feature in each tweet

In [None]:
## We are defining a function to appropriately process the text document

def document_features(document): 
    document_words = set(document) #getting the unique number of entries in the document variable
    features = {} #defining an empty dictionary
    for word in word_features: #looping over the 'word_features' which has been defined in the last code block
        features[f'contains({word})'] = (word in document_words) #defining 'features' in  particular format and checking whether the unique elements of the input 'document' are contained in the 'word_features' defined before
    return features

In [None]:
document_features(['apple','iphone'])

### Tokenize the list of words in each tweet

In [None]:
frame = Apple_tweets.copy() #storing Apple_tweets in another variable
frame.columns = ["feature", "label"] # defning the names of the colummn of the data frame 'frame'

frame['feature'] = frame.apply(lambda x: nltk.word_tokenize(str(x['feature'])), axis=1) #the features of the 'frame' data frame are stored in the variable 'feature'
# In the above code snippet we are tokenizing the variables
frame['label'] = frame.label # the labels of the 'frame' data frame are stored in the variable 'label'

In [None]:
# compare the contents of Apple_tweets df and frame df
Apple_tweets

In [None]:
frame

In [None]:
frame['feature'][88]

### Create a Feature set for each tweet

Presence or absence of each of the 2000 words in the tweet along with the target variable value for each tweet.. this info is generated for each tweet.

In [None]:
## We are now creating our combined data frame which we will split into training and test before fitting a classifier

# We are creating a list the elements of which are a tuple. We are appending the list with tuples whose entries are the pre-processed tweets and the corresponding sentiment attached to it.
featuresets = [(document_features(feature), label) for index, (feature, label) in frame.iterrows()]

In [None]:
featuresets[0] #feature values for the 1st tweet shown as sample. This is a tuple containing 2 elements. First element is a dictionary containing the 2000 words as the keys and values are true/false depending on if the word is found in the tweet. The second element of the tweet is the target variable value (positive / negative)

### Train a nltk.Naive Bayes Classifier

In [None]:
# Train Naive Bayes classifier

# first 70% of tweets taken in training set and remaining in test set
# remember we have already randomly mixed up the tweets with respective labels

train_set, test_set = featuresets[0:int(len(featuresets)*0.7)], featuresets[int(len(featuresets)*0.7):]

classifier = nltk.NaiveBayesClassifier.train(train_set)

### Test Data Predictions Accuracy

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

# Alternate Way: using Naive Bayes Classifier.
Now, let us reload the data and look at other text mining functionalities that Python offers us and then go on to fit a classifier algorithm.

In [None]:
## Loading the dataset

Apple_tweets = pd.read_csv("Apple_tweets.csv")

## Basic Exploration in Text Mining

### Number of words

**To create a temporary function lambda can be used. These functions do not require a name like a def function, however the output is same as defining a permanent function**
**As these function are temporary, memory comsumption is less in comparison to permanent function. Also there are multiple ways to get a similar output**


In [None]:
## Let's get a word count without writing a lambda function

# total words in each tweet created as a new column in the dataframe
Apple_tweets['totalwords'] = [len(x.split()) for x in Apple_tweets['Tweet'].tolist()]


Apple_tweets[['Tweet','totalwords']].head()

In [None]:
# alternate way of doing the same thing
Apple_tweets['word_count'] = Apple_tweets['Tweet'].apply(lambda x: len(str(x).split(" ")))
Apple_tweets[['Tweet','word_count']].head()

### Number of Characters- including spaces

In [None]:
Apple_tweets['char_count'] = Apple_tweets['Tweet'].str.len()

Apple_tweets[['Tweet','char_count']].head()

In [None]:
Apple_tweets['Tweet'][0]

In [None]:
len(Apple_tweets['Tweet'][0])

### Average Word Length

In [None]:
# status = 'balaji'

In [None]:
# for i in status:
    print(i)

In [None]:
def avg_word(sentence):
    #splitting the words separately from the input taken
    words = sentence.split() 
    return (sum(len(word) for word in words)/len(words)) 
    # getting the average number of words in the each of the entries

Apple_tweets['avg_word'] = Apple_tweets['Tweet'].apply(lambda x: avg_word(x))
Apple_tweets[['Tweet','avg_word']].head()

### Number of stop Words

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

Apple_tweets['stopwords'] = Apple_tweets['Tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
Apple_tweets[['Tweet','stopwords']].head()

In [None]:
# Alternate way - Nos of stop words
nos_stop = [] # empty list to store count of stop words in each tweet
for i in range(len(Apple_tweets)):
    wrds = Apple_tweets['Tweet'][i].split()
    nos_stop.append(len([words for words in wrds if words in stop]))

nos_stop

### Number of hashtags

In [None]:
Apple_tweets['hastags'] = Apple_tweets['Tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
Apple_tweets[['Tweet','hastags']].head()

### Number of numerics

In [None]:
Apple_tweets['numerics'] = Apple_tweets['Tweet'].apply(lambda x: len(re.findall(r'[0-9]',x)))
Apple_tweets[['Tweet','numerics']].head()

### Number of Uppercase Words

In [None]:
Apple_tweets['upper'] = Apple_tweets['Tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
Apple_tweets[['Tweet','upper']].head()

### Number of Uppercase Letters

In [None]:
Apple_tweets['upper_letter'] = Apple_tweets['Tweet'].apply(lambda x: len(re.findall(r'[A-Z]',x)))
Apple_tweets[['Tweet','upper_letter']].head()

## Basic Pre-Processing

### Lower Case conversion

In [None]:
Apple_tweets['Tweet'] = Apple_tweets['Tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
Apple_tweets['Tweet'].head()

### Removal of all non-alphanumric and non-space

In [None]:
Apple_tweets['Tweet'] = Apple_tweets['Tweet'].str.replace('[^\w\s]','')
#\w: Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
#\s: Returns a match where the string contains a white space character.
#[^]: Returns a match for any character EXCEPT what is written after it.
Apple_tweets['Tweet'].head()

### Removal of StopWords

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
Apple_tweets['Tweet'] = Apple_tweets['Tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
Apple_tweets['Tweet'].head()

### Common Words Removal
1. **We will create a list of 10 frequently occuring words and then decide if we need to remove it or retain it.**
2. **Reason is that this file has tweets related to Apple.. So no point in keeping the word like Apple, unless we have tweets from other brands**

In [None]:
freq = pd.Series(' '.join(Apple_tweets['Tweet']).split()).value_counts()[:10]
freq

1. **As we are talking about multiple products hence iphone will be kept, similarly some tweets do relate to old products without mentioning the word old, hence even new would be kept in the tweets.**
2. **hence only apple and get would be removed**

In [None]:
freq =['apple','get']

In [None]:
Apple_tweets['Tweet'] = Apple_tweets['Tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
Apple_tweets['Tweet'].head()

### Rare Words Removal
**This is done as association of these less occurring words with the existing words could be a noise**

In [None]:
freq = pd.Series(' '.join(Apple_tweets['Tweet']).split()).value_counts().tail(10)
freq
## As it is difficult to make out if these words will have association in text analytics or not, hence to start with these words are kept in the dataset

### Stemming

 Refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach.

In [None]:
from nltk.stem import PorterStemmer #snowball stemmer
st = PorterStemmer()
Apple_tweets['Tweet'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

### Target Variable Conversion

Now to get the sentiments as positive and negative , convert the Avg column . If value is >= 0  then tweet is Positive, else tweet is Negative. This will make a dependent variable as a binary classifier

In [None]:
Apple_tweets["Sentiment"] = Apple_tweets["Avg"].apply(get_senti) # get_senti is a user defined function we create earlier

In [None]:
Apple_tweets.head()

In [None]:
Apple_tweets.info()

## Distribution of Target Variable : Sentiment

In [None]:
Apple_tweets.Sentiment.value_counts(normalize=True)

In [None]:
processed_features = Apple_tweets.iloc[:, 0].values # X_train..
labels = Apple_tweets.iloc[:, 11].values # y_train..

In [None]:
processed_features

In [None]:
labels

## TfidfVectorizer

More here - https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8)
processed_features = vectorizer.fit_transform(processed_features).toarray()

# fit_transform() here returns a document term matrix
# we are converting the output into an array so that we can put it into a dataframe

In [None]:
## Extra Knowledge Bytes (TF-IDF)

# Let's see how our TD-IDF looks like (sorting by the feature named 5s)
# Creating the TF-IDF with the feature names given by the TFIDF vectorizer, sorting it for unerstanding.
# Let's chain the .head() method on the DataFrame to inspect the first few observations of the TD-IDF sorted by '5s'

pd.DataFrame(processed_features, columns = vectorizer.get_feature_names()).sort_values(by = '5s', ascending=False).head()

## Train-Test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.3, random_state=0)

In [None]:
y_train[:10] # sample of first 10 values in target variable

# Gaussian Naive Bayes

In [None]:
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import plot_confusion_matrix, classification_report

In [None]:
NB_model = GaussianNB()
NB_model.fit(X_train, y_train)

## Train Data predictions

In [None]:
y_train_predict = NB_model.predict(X_train)

In [None]:
## Accuracy
model_score = NB_model.score(X_train, y_train)                      
model_score

In [None]:
import seaborn as sns
sns.set()

In [None]:
## confusion_matrix
plot_confusion_matrix(NB_model,X_train,y_train,colorbar=False);
plt.grid(b=False, axis='both');

In [None]:
## classification_report
print(classification_report(y_train,y_train_predict))  

## Test Data Predictions

In [None]:
## Performance Matrix on test data set
y_test_predict = NB_model.predict(X_test)

In [None]:
model_score = NB_model.score(X_test, y_test) # accuracy
model_score

In [None]:
plot_confusion_matrix(NB_model,X_test, y_test,colorbar=False);

In [None]:
print(classification_report(y_test, y_test_predict))

**Pl. note - Model building is an iterative process. Model performance both on the test and train dataset can be improved using feature engineering, feature extraction, hyper parameter tuning (including combination of various parameters).** 

**Model has to match the business objective and hence various permutation and combinations can be tried on to refine the model**

# Creating a Wordcloud

## Clean up a bit more !

In [None]:
# Removing symbols and punctuations 
# further_clean = Apple_tweets['Tweet'].str.replace('[^\w\s]','')

# Extending the list of stop words (including words like Apple, bitly, dear, please, etc.)
stop_words = list(stopwords.words('english'))
stop_words.extend(["apple", "http","bit","bitly","bit ly", "dear", "im", "i'm", "please"])

In [None]:
#Removing stop words (extended list as above) from the corpus 

corpus = Apple_tweets['Tweet'].apply(lambda x: ' '.join([z for z in x.split() if z not in stop_words])) 

corpus

In [None]:
wc_a = ' '.join(corpus)

In [None]:
# Word Cloud 
from wordcloud import WordCloud
# wordcloud = WordCloud().generate(wc_a) if ok with default wordcloud parameters

wordcloud = WordCloud(width = 3000, height = 3000, 
                background_color ='black', 
                min_font_size = 10, random_state=100).generate(wc_a) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off")
plt.xlabel('Word Cloud')
plt.tight_layout(pad = 0) 

print("Word Cloud for Apple_Tweets (after cleaning)!!")


#Tip: You can specify stopwords, regex (punctuations/symbols) in the wordcloud itself, check CTRL+TAB on the wordcloud function!

# END