<a href="https://colab.research.google.com/github/snikhil17/NLP_Simplilearn/blob/main/capstone_project_2_wikipedia_toxicity/wikiToxicity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem Context**
#### **Description:** 
- Using NLP and machine learning, make a model to identify toxic comments from the Talk edit pages on Wikipedia. Help identify the words that make a comment toxic.

#### **Statement:**  
- Wikipedia is the world’s largest and most popular reference work on the internet with about 500 million unique visitors per month. It also has millions of contributors who can make edits to pages. The Talk edit pages, the key community interaction forum where the contributing community interacts or discusses or debates about the changes pertaining to a particular topic. 
  - Wikipedia continuously strives to help online discussion become more productive and respectful. You are a data scientist at Wikipedia who will help Wikipedia to build a predictive model that identifies toxic comments in the discussion and marks them for cleanup by using NLP and machine learning. Post that, help identify the top terms from the toxic comments. 

#### **Analysis to be done:**
- Build a text classification model using NLP and machine learning that detects toxic comments.

#### **Data Dictionary:** 
- id: identifier number of the comment
- comment_text: the text in the comment
- toxic: 0 (non-toxic) / 1 (toxic)

#### **Steps to perform:**
- Cleanup the text data, using TF-IDF convert to vector space representation, use Support Vector Machines to detect toxic comments. 
- Finally, get the list of top 15 toxic terms from the comments identified by the model.

#### **Sub-Tasks:** 
- Load the data using read_csv function from pandas package
- Get the comments into a list, for easy text cleanup and manipulation
- Cleanup: 
  - Using regular expressions, remove IP addresses
  - Using regular expressions, remove URLs
  - Normalize the casing
  - Tokenize using word_tokenize from NLTK
  - Remove stop words
  - Remove punctuation
  - Define a function to perform all these steps, you’ll use this later on the actual test set
- Using a counter, find the top terms in the data.
  - Can any of these be considered contextual stop words?
  - Words like “Wikipedia”, “page”, “edit” are examples of contextual stop words
  - If yes, drop these from the data
- Separate into train and test sets
  - Use train-test method to divide your data into 2 sets: train and test
  - Use a 70-30 split
- Use TF-IDF values for the terms as feature to get into a vector space model
  - Import TF-IDF vectorizer from sklearn
  - Instantiate with a maximum of 4000 terms in your vocabulary
  - Fit and apply on the train set
  - Apply on the test set
- Model building: Support Vector Machine
  - Instantiate SVC from sklearn with a linear kernel
  - Fit on the train data
  - Make predictions for the train and the test set
- Model evaluation: Accuracy, recall, and f1_score
  - Report the accuracy on the train set
  - Report the recall on the train set:decent, high, low?
  - Get the f1_score on the train set
- Looks like you need to adjust  the class imbalance, as the model seems to focus on the 0s
  - Adjust the appropriate parameter in the SVC module
- Train again with the adjustment and evaluate
  - Train the model on the train set
  - Evaluate the predictions on the validation set: accuracy, recall, f1_score
- Hyperparameter tuning
  - Import GridSearch and StratifiedKFold (because of class imbalance)
  - Provide the parameter grid to choose for ‘C’
  - Use a balanced class weight while instantiating the Support Vector Classifier
- Find the parameters with the best recall in cross validation
  - Choose ‘recall’ as the metric for scoring
  - Choose stratified 5 fold cross validation scheme
  - Fit on the train set
- What are the best parameters?
- Predict and evaluate using the best estimator
  - Use best estimator from the grid search to make predictions on the test set
  - What is the recall on the test set for the toxic comments?
  - What is the f1_score?
- What are the most prominent terms in the toxic comments?
  - Separate the comments from the test set that the model identified as toxic
  - Make one large list of the terms
  - Get the top 15 terms

## **Aquire the data**

In [None]:
!wget https://raw.githubusercontent.com/snikhil17/NLP_Simplilearn/main/capstone_project_2_wikipedia_toxicity/train.csv

## **Loading required Libraries**

In [2]:
# !python3 -m spacy download en_core_web_sm

In [3]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import re
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import wordnet
import string
from nltk.stem import WordNetLemmatizer

nltk.download('words')

from nltk.corpus import stopwords
nltk.download('stopwords')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy
nlp = spacy.load('en_core_web_sm')


from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, StratifiedKFold

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,comment_text,toxic
0,e617e2489abe9bca,"""\r\n\r\n A barnstar for you! \r\n\r\n The De...",0
1,9250cf637294e09d,"""\r\n\r\nThis seems unbalanced. whatever I ha...",0
2,ce1aa4592d5240ca,"Marya Dzmitruk was born in Minsk, Belarus in M...",0
3,48105766ff7f075b,"""\r\n\r\nTalkback\r\n\r\n Dear Celestia... """,0
4,0543d4f82e5470b6,New Categories \r\n\r\nI honestly think that w...,0


## **Cleaning Helper Functions** 
- **Function for decontracted and removing IP adresses and other expressions using Regex**
  - words like don't , won't etc will be converted to do not, will not etc.
  - Emojis, additional lines, email-addresses, website names etc are removed here.
- **Removed words like ``no, not, nor`` from english stopwords**
- **Removing Non-English Words**
  - whatever words nltk corpus has, if given word is in that corpus we will consider the word, else replace with blank.
- **Removing Contextual Stopwords**
  - Using POS-tagging feature of Spacy, removed words which were Pronoun, punctuation, number, adverb  etc.
  - Lemmatizing th words and checking if the given word exist in the list of words needs to be removed (contextual words: obtained after checking the word counts)
- **Remaining Preprocessing of text**
  - Combining all the above functions to use as analyzer in TF-IDF vectorizer. 

In [5]:
"""Functions to clean text using Regex"""
ip_addr_regex = re.compile(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b')
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
def regex_cleaning(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"[^a-zA-Z0-9]+", " ", phrase)
    phrase = re.sub(r"\r\n", "", phrase)            # Removing additional line
    phrase = re.sub(r"\n", "", phrase)              # Removing additional line 
    phrase = re.sub(r"\S*@\S*\s?", "", phrase)      # Removing email-addresses 
    phrase = re.sub(r'http\S+', '', phrase)         # Removing website links
    phrase = re.sub(ip_addr_regex, "", phrase)      # Removing IP address link.
    phrase = emoji_pattern.sub(r'', phrase)         # Removing Emojis
    
    return phrase.lower() 

"""Ammending Stopwords list"""
stop_words = stopwords.words('english')
for i in ['nor', 'not', 'no']:
  stop_words.remove(i)

"""Removing Non-English Words"""
nltk_words = set(nltk.corpus.words.words())
def only_eng(row):
  for word in row:
    if word in nltk_words or not word.isalpha():
      word = word
    else:
      word = ""
  return row

"""Helper function for lemmatization_check"""
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# """Extracting Lemmatized word to take care that words like page, paged, pages can be removed at once."""

def lemmatization_check(word):
  word_pos = nltk.pos_tag(word)
  lemma_word = lemmaObject.lemmatize(word,get_wordnet_pos(word))
  return lemma_word

"""Removing Contextual Stopwords"""
# 'ADJ',
lemmaObject = WordNetLemmatizer()
pos_taggings_to_remove = ['DET', 'ADP','ADV','AUX', 'SCONJ',  'INTJ','PUNCT', 'NUM',  'SPACE', 'X']
some_words_to_remove = ['you', 'wikipedia', 'wiki' ,'edit', 'page', 'would', 'article', 'articles','page', 'edits', 'user']
def remove_contextual_stopwords(row):
  row= row.lower()  
  final_row = []
  row = "".join([lemmaObject.lemmatize(word[0],get_wordnet_pos(word[1])) for word in nltk.pos_tag(row)])
  row = nlp(row)
  for word in row:    
    if word.pos_ not in pos_taggings_to_remove and word.text not in some_words_to_remove:
      final_row.append(word.text)
    else:
      continue
  return " ".join(final_row)

"""Remaining Preprocessing of text"""
def final_preprocessing(document):
  document_regex_cleaned = regex_cleaning(document)
  noNonEnglishWords = only_eng(document_regex_cleaned)
  words = [" ".join(nltk.word_tokenize(title)) for title in noNonEnglishWords.split()]                                         # Word Tokenization0
  wordWithoutStopwords = [word for word in words if word not in stop_words if len(word) > 3]                                   # Remove Stopwords
  vocabulary = " ".join([char for char in wordWithoutStopwords if char not in string.punctuation ])                            # Remove Punctuations
  noContextualStopwords = remove_contextual_stopwords(vocabulary)                                                              # Remove Contextual Stopwords
  return noContextualStopwords.split()                                                                                         # Returning lemmatized-Vocabulary