# PART ONE

# QUESTION:

• **DOMAIN**: Digital content management

• **CONTEXT**: Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

• **DATA DESCRIPTION**: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of
19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and
the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is
marked as unknown.) All bloggers included in the corpus fall into one of three age groups:

• 8240 "10s" blogs (ages 13-17),    
• 8086 "20s" blogs(ages 23-27) and.    
• 2994 "30s" blogs (ages 33-47)


• For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of
common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label url link.

• **PROJECT OBJECTIVE**: To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case
study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.

Steps and tasks: [ Total Score: 40 Marks]

1. Read and Analyse Dataset. [5 Marks]

    A. Clearly write outcome of data analysis(Minimum 2 points) [2 Marks].  
    B. Clean the Structured Data [3 Marks].  
        i. Missing value analysis and imputation. [1 Marks]
        ii. Eliminate Non-English textual data. [2 Marks]
             Hint: Refer ‘langdetect’ library to detect language of the input text)

2. Preprocess unstructured data to make it consumable for model training. [5 Marks]

    A. Eliminate All special Characters and Numbers [2 Marks].  
    B. Lowercase all textual data [1 Marks].  
    C. Remove all Stopwords [1 Marks].   
    D. Remove all extra white spaces [1 Marks].  

3. Build a base Classification model [8 Marks]

    A. Create dependent and independent variables [2 Marks].  
        Hint: Treat ‘topic’ as a Target variable.
    B. Split data into train and test. [1 Marks].  
    C. Vectorize data using any one vectorizer. [2 Marks].   
    D. Build a base model for Supervised Learning - Classification. [2 Marks].  
    E. Clearly print Performance Metrics. [1 Marks].  
        Hint: Accuracy, Precision, Recall, ROC-AUC

4. Improve Performance of model. [14 Marks].  

    A. Experiment with other vectorisers. [4 Marks].  
    B. Build classifier Models using other algorithms than base model. [4 Marks].  
    C. Tune Parameters/Hyperparameters of the model/s. [4 Marks].  
    D. Clearly print Performance Metrics. [2 Marks].  
        Hint: Accuracy, Precision, Recall, ROC-AUC.  

5. Share insights on relative performance comparison [8 Marks].  

    A. Which vectorizer performed better? Probable reason?. [2 Marks].   
    B. Which model outperformed? Probable reason? [2 Marks].   
    C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?. [2 Marks].    
    D. According to you, which performance metric should be given most importance, why?. [2 Marks]. 

**Mapping the drive**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Importing the variables**

In [1]:
%tensorflow_version 2.x
import tensorflow
tensorflow.__version__

UsageError: Line magic function `%tensorflow_version` not found.


In [2]:
!pip install langdetect



In [3]:
import os
import pandas as pd
from langdetect import detect
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score


### 1. Read and Analyse Dataset. [5 Marks]

    A. Clearly write outcome of data analysis(Minimum 2 points) [2 Marks].  
    B. Clean the Structured Data [3 Marks].  
        i. Missing value analysis and imputation. [1 Marks]
        ii. Eliminate Non-English textual data. [2 Marks]
             Hint: Refer ‘langdetect’ library to detect language of the input text)

**Set project directory**

**Unzipping the files and extracting the csv**


In [4]:
# project_path = "/content/drive/My Drive/aiml/nlp/project1/"

# os.chdir(project_path)

from zipfile import ZipFile

with ZipFile('blogs.zip', 'r') as zipdata:
    data_csv = zipdata.open('blogtext.csv')

**Read the csv files**

In [5]:
df = pd.read_csv(data_csv)

In [None]:
del data_csv

**Check the column names**

In [None]:
df.columns

**We have total 7 columns : 'id', 'gender', 'age', 'topic', 'sign', 'date', 'text'**

**Checking the data( First 5 rows)**

In [8]:
df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


**Checking the shape and info of the data**

In [9]:
df.shape

(681284, 7)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


**Check if there is null data present on any columns**

In [11]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64



*   There are 7 columns and 681284 rows of data
*   ID and date dont give much value in data, they can be removed.
*   Except id and age all the columns are object
*   There are no null data.




In [12]:
# df1 =df.copy()
# df1["isalpha"]=df['text'].apply(lambda x: bool(re.search('[a-zA-Z]', x)))
# df1["isalpha"].count()

681284

**Eliminate Non-English textual data.**

In [13]:
# Commenting since there is no non-english word in earlier run and its taking long time to execute

# def det(x):
#     try:
#         lang = detect(x)
#     except:
#         lang = 'Other'
#     return lang
# df['detect'] = df['text'].apply(det)


In [14]:
# df = df[df['detect'] == 'en']

****

**We have completed the first part. We read and analysed the data and their different attributes. We analysed there types and noted down the outcomes.**

**We checked for null data but there is no null data present. Then we checked for non-english data and removed them with landetect feature**

****

## Lets move to second question

2. Preprocess unstructured data to make it consumable for model training. [5 Marks]

    A. Eliminate All special Characters and Numbers [2 Marks].  
    B. Lowercase all textual data [1 Marks].  
    C. Remove all Stopwords [1 Marks].   
    D. Remove all extra white spaces [1 Marks]. 

**A. Eliminate All special Characters and Numbers**

In [15]:
df.text.head(5)

0               Info has been found (+/- 100 pages,...
1               These are the team members:   Drewe...
2               In het kader van kernfusie op aarde...
3                     testing!!!  testing!!!          
4                 Thanks to Yahoo!'s Toolbar I can ...
Name: text, dtype: object

In [16]:
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

In [17]:
df.text.head(5)

0     Info has been found pages and MB of pdf files...
1     These are the team members Drewes van der Laa...
2     In het kader van kernfusie op aarde MAAK JE E...
3                                     testing testing 
4     Thanks to Yahoo s Toolbar I can now capture t...
Name: text, dtype: object

**All data and special character removed as compared to texts printed before the step**

**B. Now lets lowercase the data**

In [18]:
df.text = df.text.apply(lambda x: x.lower())
df.text.head(5)

0     info has been found pages and mb of pdf files...
1     these are the team members drewes van der laa...
2     in het kader van kernfusie op aarde maak je e...
3                                     testing testing 
4     thanks to yahoo s toolbar i can now capture t...
Name: text, dtype: object

**All the text data are now lowercased like info, these and others**

**C. Remove all Stopwords**

In [19]:
nltk.download('stopwords')
stopwords=set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/santoshsingh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
df.text.head(5)

0    info found pages mb pdf files wait untill team...
1    team members drewes van der laag urllink mail ...
2    het kader van kernfusie op aarde maak je eigen...
3                                      testing testing
4    thanks yahoo toolbar capture urls popups means...
Name: text, dtype: object

**As we see above the stopwords like has, been, and, in and others have been removed**

**D. Remove all extra white spaces**

In [21]:
df.text = df.text.apply(lambda x: x.strip())

In [22]:
df.text[6]

'somehow coca cola way summing things well early flagship jingle like buy world coke tune like teach world sing pretty much summed post woodstock era well add much sales catchy tune korea coke theme urllink stop thinking feel pretty much sums lot korea koreans look relaxed couple stopped thinking started feeling course high regard education math logic deep think many koreans really like work emotion anything else westerners seem sublimate moreso least display different way maybe scratch westerners koreans probably pretty similar context different anyways think losing korea repeat stop thinking feel stop thinking feel stop thinking feel everything alright'

****

**Now we have completed all the processing steps like Eliminate All special Characters and Numbers, Lowercase all textual data, Remove all Stopwords, Remove all extra white spaces**

****

## Lets move to question 3

3. Build a base Classification model [8 Marks]

    A. Create dependent and independent variables [2 Marks].  
        Hint: Treat ‘topic’ as a Target variable.
    B. Split data into train and test. [1 Marks].  
    C. Vectorize data using any one vectorizer. [2 Marks].   
    D. Build a base model for Supervised Learning - Classification. [2 Marks].  
    E. Clearly print Performance Metrics. [1 Marks].  
        Hint: Accuracy, Precision, Recall, ROC-AUC

**A. Create dependent and independent variables**

**Here we have text**

**Merge all the label columns together, so that we have all the tags together for a particular sentence**

In [23]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

**Lets remove other columns and keep only taxt and label**

In [24]:
df = df[['text','labels']]

In [25]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


**Here label is dependent variable and text is independent variable**

**B. Split data into train and test.**

In [26]:
X_train, X_test, y_train, y_test = train_test_split(df.text.values, df.labels.values, test_size=0.20, random_state=42)

**We have split the data into X_train, X_test, y_train, y_test**

**C. Vectorize data using any one vectorizer.**

**Using CountVectorizer**

In [27]:
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

**Lets look at some feature names**

In [1]:
vectorizer.get_feature_names()[:5]

NameError: name 'vectorizer' is not defined

**Lets view term-document matrix**

In [None]:
X_train_bow.toarray()

**D. Build a base model for Supervised Learning - Classification.**

**Lets create a dictionary to get label counts**

In [None]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

In [None]:
label_counts

**Lets load a multilabel binarizer and fit it on the labels.**

In [None]:
mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

**Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.**

In [None]:
clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

In [None]:
clf.fit(X_train_bow, y_train)

**E. Clearly print Performance Metrics.**

In [None]:
predicted_labels = clf.predict(X_test_bow)
predicted_scores = clf.decision_function(X_test_bow)

**Get inverse transform for predicted labels and test labels**

In [None]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

**Print some samples**

In [None]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Calculate accuracy

*   Accuracy
*   F1-score
*   Precision
*   Recall


In [None]:

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [None]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)