In [23]:
import pandas as pd
import numpy as np
import re
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk import word_tokenize, corpus, tokenize
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
warnings.filterwarnings("ignore")

In [2]:
#Loading the data for mini-challenges
#Data contains text messages, we would use it to build a News classifier.
df = pd.read_csv('../data/NewsMl.csv')
df.head()

Unnamed: 0,NewsML,Labels
0,The Internet may be overflowing with new techn...,AaronPressman
1,The U.S. Postal Service announced Wednesday a ...,AaronPressman
2,Elementary school students with access to the ...,AaronPressman
3,An influential Internet organisation has backe...,AaronPressman
4,An influential Internet organisation has backe...,AaronPressman


In [3]:
#Feature vectors
X = df['NewsML']
print(X[0])
print("=="*30)
#target variable
y = df['Labels'] 
print(y[0])

The Internet may be overflowing with new technology but crime in cyberspace is still of the old-fashioned variety. The National Consumers League said Wednesday that the most popular scam on the Internet was the pyramid scheme, in which early investors in a bogus fund are paid off with deposits of later investors. The league, a non-profit consumer advocacy group, tracks web scams through a site it set up on the world wide web in February called Internet Fraud Watch at http://www.fraud.org. The site, which collects reports directly from consumers, has been widely praised by law enforcement agencies. "Consumers who suspect a scam on the Internet have critical information," said Jodie Bernstein, director of the Federal Trade Commission's Bureau of Consumer Protection. Internet Fraud Watch "has been a major help to the FTC in identifying particular scams in their infancy." In May, for example, the commission used Internet reports to shut down a site run by Fortuna Alliance that had taken in

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 1
***
### Instructions
- Preprocess text data by doing the following steps on the `NewsML` column
        - retain only alphabets
        - Convert the data to lowercase and tokezize
        - Remove stop words by using list comprehension
        - join list elements
- Finally split into train and test using train_test_split function where feature is `X`, target is `y`,test size is 20% and random state is 3. Save the resultant variables as X_train, X_test, y_train and y_test

In [4]:
X = X.apply(lambda x : re.sub('[^a-zA-Z]',' ', x))

In [5]:
X = X.apply(lambda x : x.lower())

In [6]:
for i in range(X.shape[0]):
    X[i] = tokenize.word_tokenize(X[i])

In [7]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Error loading stopwords: HTTP Error 503: Backend is
[nltk_data]     unhealthy


In [8]:
stop = set(stopwords.words('english'))

In [9]:
X = X.apply(lambda x : [i for i in x if i not in stop])

In [10]:
X = X.apply(lambda x : ' '.join(x))

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 2
***
**Vectorize with Bag-of-words and TF-IDF approach**: <br>
After cleaning data its time to vectorize data so that it can be fed into an ML algorithm. You will be doing it with two approaches: Bag-of-words and TF-IDF.

### Instructions

- Initialize Bag-of-words vectorizer using CountVectorizer() and TF-IDF vectorizer using TfidfVectorizer(ngram_range=(1,3)).     Save them as count_vectorizer and tfidf_vectorizer respectively.

- Next thing to do is fit each vectorizer on training and test features with text data and transform them to vectors.

- First fit and transform data with count_vectorizer on X_train using .fit_transform(X_train) method of count_vectorizer and     save it as `X_train_count`

- Use this fitted version of count_vectorizer on X_test and transform X_test with .transform(X_test) method of                   count_vectorizer. Save it as `X_test_count`

- Similarly repeat the previous two steps with tfidf_vectorizer and save the transformed training feature as `X_train_tfidf`     and transformed test feature as `X_test_tfidf`

In [12]:
count = CountVectorizer()
tf_idf = TfidfVectorizer(range(1, 3))

X_train_count = count.fit_transform(X_train)
X_test_count = count.transform(X_test)

X_train_tf_idf = tf_idf.fit_transform(X_train)
X_test_tf_idf = tf_idf.transform(X_test)

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 3
***
**Predicting with Multinomial Naive Bayes**:
Multinomial Naive Bayes is an algorithm that can be used for the purpose of multi-class classification. You will be using it to train and test it on both the versions i.e. Bag-of-words and TF-IDF ones and then checking the accuracy on both of them.

### Instructions
- Initialize two Multinomial Naive Bayes classifiers with MultinomialNB() and save them as `nb_1` and `nb_2`. The reason for initializing two classifiers is because you will be training and testing on both Bag-of-words and TF-IDF transformed training data.

- Fit nb1 on X_train_count and y_train using `.fit()` method

- Fit nb2 on X_train_tfidf and y_train using `.fit()` method

- Find the accuracy with Bag-of-words approach using accuracy_score(nb_1.predict(X_test_count), y_test) and save it as           `acc_count_nb 

- Similarly find the accuracy for the TF-IDF approach (only difference is the classifer is nb_2) and save it as `acc_tfidf_nb`

- Print out `acc_count_nb` and `acc_tfidf_nb` to check which version performs better for with Multinomial Naive Bayes as         classifer

In [13]:
nb_1, nb_2 = MultinomialNB(), MultinomialNB()

In [17]:
nb_1.fit(X_train_count, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [18]:
nb_2.fit(X_train_tf_idf, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [21]:
accuracy_score(nb_1.predict(X_test_count), y_test)

0.92

In [22]:
accuracy_score(nb_2.predict(X_test_tf_idf), y_test)

0.94

<img src="../images/icon/ppt-icons.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Mini-Challenge - 4
***
**Predicting with Logistic Regression**
Logistic Regression can be used for binary classification but when combined with OneVsRest classifer, it can perform multiclass classification as well. You will be using one such algorithm to train and test it on both the versions i.e. Bag-of-words and TF-IDF ones and then checking the accuracy on both of them

### Instructions

- First initialize two classifiers with OneVsRestClassifier(LogisticRegression(random_state=10)) and save them as `logreg_1`     and `logreg_2`. The reason for initializing two classifiers is because you will be training and testing on both Bag-of-words   and TF-IDF transformed training data.

- Fit logreg_1 on X_train_count and Y_train using .fit() method

- Fit logreg_2 on X_train_tfidf and Y_train using .fit() method

- Find the accuracy with Bag-of-words approach using accuracy_score(logreg_1.predict(X_test_count), Y_test) and save it as       `acc_count_logreg`

- Similarly find the accuracy for the TF-IDF approach (only difference is the classifer is logreg_2) and save it as               `acc_tfidf_logreg`

- Print out `acc_count_logreg` and `acc_tfidf_logreg` to check which version performs better for with Multinomial Naive Bayes     as classifer

In [25]:
log_reg_1, log_reg_2 = OneVsRestClassifier(LogisticRegression(random_state=10)), OneVsRestClassifier(LogisticRegression(random_state=10))

In [26]:
log_reg_1.fit(X_train_count, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=10, solver='warn',
                                                 tol=0.0001, verbose=0,
                                                 warm_start=False),
                    n_jobs=None)

In [27]:
log_reg_2.fit(X_train_tf_idf, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=10, solver='warn',
                                                 tol=0.0001, verbose=0,
                                                 warm_start=False),
                    n_jobs=None)

In [28]:
accuracy_score(log_reg_1.predict(X_test_count), y_test)

0.96

In [29]:
accuracy_score(log_reg_2.predict(X_test_tf_idf), y_test)

0.94

<img src="../images/icon/quiz.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

##  Foundation of Text Analytics
***
Q1. N-grams are defined as the combination of N keywords together. How many bi-grams can be generated from given sentence:
“Greyatom is a great source to learn data science”
```python
A) 7
B) 8
C) 9
D) 10
E) 11


```
Q2. Which of the following models can perform tweet classification with regards to context mentioned above?
 
 Suppose You have collected a data of about 10,000 rows of tweet text and no other information. You want to create a tweet      classification model that categorizes each of the tweets in three buckets – positive, negative and neutral.
```python
A) Naive Bayes
B) SVM
C) None of the above


```