
# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



## Learning Objective


At the end of the experiment, you will be able to:

*  Pre-process the data
*  Representation of  text document using Bag of Words & Word2Vec

In [None]:
#@title Experiment Walkthrough Video
#@markdown BoW vs W2V
from IPython.display import HTML

HTML("""<video width="850" height="480" controls>
  <source src="https://cdn.exec.talentsprint.com/non-processed/Bag_of_Words_Vs_Word2Vec.mp4">
</video>
""")

## Dataset

   This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from [HuffPost](https://www.huffpost.com/). The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

Each news headline has a corresponding category. Categories and corresponding article counts as follows:


    POLITICS: 32739
    WELLNESS: 17827
    ENTERTAINMENT: 16058
    TRAVEL: 9887
    STYLE & BEAUTY: 9649
    PARENTING: 8677
    HEALTHY LIVING: 6694
    QUEER VOICES: 6314
    FOOD & DRINK: 6226
    BUSINESS: 5937
    COMEDY: 5175
    SPORTS: 4884
    BLACK VOICES: 4528
    HOME & LIVING: 4195
    PARENTS: 3955
    THE WORLDPOST: 3664
    WEDDINGS: 3651
    WOMEN: 3490
    IMPACT: 3459
    DIVORCE: 3426
    CRIME: 3405
    MEDIA: 2815
    WEIRD NEWS: 2670
    GREEN: 2622
    WORLDPOST: 2579
    RELIGION: 2556
    STYLE: 2254
    SCIENCE: 2178
    WORLD NEWS: 2177
    TASTE: 2096
    TECH: 2082
    MONEY: 1707
    ARTS: 1509
    FIFTY: 1401
    GOOD NEWS: 1398
    ARTS & CULTURE: 1339
    ENVIRONMENT: 1323
    COLLEGE: 1144
    LATINO VOICES: 1129
    CULTURE & ARTS: 1030
    EDUCATION: 1004


#### Description
This dataset has the following columns:
1. **Category:** Category article belongs to
2. **Headline:** Determines the Headline of the article
3. **Authors:** Person authored the article
4. **Link:** Link to the post
5. **Short_description:** Short description of the article
6. **Date:** Date the article was published

Out of 41 category's from the News_Category_Dataset, we consider four category's (Travel, Tech, Science, College) for this experiment

### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython
import re
ipython = get_ipython()

notebook= "U2W6_18_News_Category_Dataset_BoW_vs_W2V_A" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/News_Category_Dataset_v2.csv")
    ipython.magic("sx wget https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
    ipython.magic("sx unrar e /content/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getWalkthrough() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook, "feedback_walkthrough":Walkthrough ,
              "feedback_experiments_input" : Comments,
              "feedback_inclass_mentor": Mentor_support}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://learn-iiith.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


def getWalkthrough():
  try:
    if not Walkthrough:
      raise NameError
    else:
      return Walkthrough
  except NameError:
    print ("Please answer Walkthrough Question")
    return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



## Import packages


In [None]:
import re
import nltk
import pandas as pd
import numpy as np
import gensim
from nltk.corpus import stopwords
nltk.download('stopwords')
import warnings
warnings.filterwarnings('ignore')

## Load the data


In [None]:
# Load the data
df = pd.read_csv('News_Category_Dataset_v2.csv')
df.head()

In [None]:
# YOUR CODE HERE: Count the classes in category


## Data Pre-processing

We are considering four category's (Travel, Tech, Science, College) for this experiment

Hint:   To access Sub-Categories from given Data, refer [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)

In [None]:
# Create a list of manually selected category
category = ['TRAVEL','TECH','SCIENCE','COLLEGE']


df =   # YOUR CODE HERE : Load the dataset based on the category


In [None]:
# Add the two columns into text column
df['text'] = df['headline'] +','+ df['short_description']
df['label'] = df['category']

Drop the unwanted columns

In [None]:
# YOUR CODE HERE : drop the columns which are unwanted

Consider text column as feature and label as target variable. Convert label into numerical.

Hint: Label Encoder for obtaining a numeric representation, refer to the [link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [None]:
from sklearn import preprocessing
# YOUR CODE HERE: Convert label into numericals

In [None]:
df['text'].shape, df['label'].shape

## BoW

### TF IDF
 tf-idf aims to represent the number of times a given word appears in a document (a movie review in our case) relative to the number of documents in the corpus that the word appears in â€” where, words that appear in many documents have a value closer to zero and words that appear in less documents have values closer to 1.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# YOUR CODE HERE: Call the object for TfidfVectorizer.
# YOUR CODE HERE: Complie the TfidfVectorizer with text columns.

### Split the data into train and test sets

Hint: Refer to[Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
# YOUR CODE HERE : split the data on 80 - 20 %

### Apply the Classification


In [None]:
# YOUR CODE HERE: To create an instance and fit the model
# YOUR CODE HERE: To compute the accuracy

## Word2Vec

###Load pre-trained Word2Vec

Lets now proceed to load the complete pretrained vectors.

In [None]:
W2Vmodel = gensim.models.KeyedVectors.load_word2vec_format('AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

### Word2Vec representation

Convert each document into average of the word2vec vectors of all valid words in document

Note: Below code cell take some time to compile

In [None]:
# Creating empty final dataframe
docs_vectors = pd.DataFrame()

# Removing stop words
stopwords = nltk.corpus.stopwords.words('english')
text = df['text'].astype(str)
# Looping through each document and cleaning it
for doc in text.str.lower().str.replace('[^a-z ]', ''):
    temp = pd.DataFrame()
    for word in doc.split(' '):
        # If word is not present in stopwords then (try)
        if word not in stopwords and word.isalpha():
            try:
                # If word is present in embeddings then get the vector representation and append it to temporary dataframe
                word_vec = W2Vmodel[word]
                temp = temp._append(pd.Series(word_vec), ignore_index = True)

            except:
                pass
    # Take the average of vectors for each word
    doc_vector = temp.mean()
    # Append each document value to the final dataframe
    docs_vectors = docs_vectors._append(doc_vector, ignore_index = True)
docs_vectors.shape



### Split the data into train and test sets

Hint: Refer to[Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
# YOUR CODE HERE : split the data on 80 - 20 %

### Apply the Classification


In [None]:
# YOUR CODE HERE: To create an instance and fit the model
# YOUR CODE HERE: To compute the accuracy

### Please answer the questions below to complete the experiment:




In [None]:
#@title What is the difference between word2vec and BOW? { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["","Word2vec produces one vector per word whereas BoW produces one number that is a word count","Word2vec produces one  number that is a word count whereas BoW produces one vector per word"]



In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Experiment walkthrough video? { run: "auto", vertical-output: true, display-mode: "form" }
Walkthrough = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")