# <A href="https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge">Toxic Comment Classification</a>
Given dataset of wikipedia comments, we are supposed to build a model that can predict the probability that the comments falls under the categories of toxicity (6 categories in this case)

This is a supervised learning problem as we are given a dataset and the also if it falls in the toxic category (any one of the classes, or multiple classes)

This is a classification problem as we are classifying the dataset into classes. Also, we are supposed to predict the probability that it falls under each classes, so this is also a regression problem

We use three algorithms to train our model
* Logistic Regression
* Naive Bayes
* Support Vector Machine

We are given a train set, and a separate test set. So, we test out model against the given test set. We also divide the training set into 2 sets: train set and cross validation set so that we can tune the hyperparameters and choose the one which gives the maximum accuracy for the cross validation set.

In [None]:
#import all the libraries
from typing import *
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import re
from copy import deepcopy
from joblib import dump, load

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Exploring the dataset
We now load the dataset and do some preliminary exploration of the data. This helps us better understand the dataset.

In [None]:
#Download the dataset
!gdown --id 10N4pLNsHD69tv5DbJ-cji6s742wVXelb
from zipfile import ZipFile
with ZipFile('jigsaw-toxic-comment-classification-challenge.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()
with ZipFile('sample_submission.csv.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()
with ZipFile('test.csv.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()
  
with ZipFile('test_labels.csv.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()

with ZipFile('train.csv.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()
!rm sample_submission.csv.zip test.csv.zip test_labels.csv.zip train.csv.zip jigsaw-toxic-comment-classification-challenge.zip

Downloading...
From: https://drive.google.com/uc?id=10N4pLNsHD69tv5DbJ-cji6s742wVXelb
To: /content/jigsaw-toxic-comment-classification-challenge.zip
55.2MB [00:00, 133MB/s] 


In [None]:
#load into dataframe
sample_sub_df = pd.read_csv('sample_submission.csv', delimiter=',')
test_df = pd.read_csv('test.csv', delimiter=',')
test_label_df = pd.read_csv('test_labels.csv', delimiter=',')
train_df = pd.read_csv('train.csv', delimiter=',')

In [None]:
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


There are 6 categories of toxicity

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB


There are 159571 rows and no null values


In [None]:
#checking if there is any imbalance in the dataset
cols = ["toxic","severe_toxic","obscene","threat","insult","identity_hate"]
print("unique values")
for col in cols:
  print(col)
  print(train_df[col].value_counts())
  print("*"*29)

unique values
toxic
0    144277
1     15294
Name: toxic, dtype: int64
*****************************
severe_toxic
0    157976
1      1595
Name: severe_toxic, dtype: int64
*****************************
obscene
0    151122
1      8449
Name: obscene, dtype: int64
*****************************
threat
0    159093
1       478
Name: threat, dtype: int64
*****************************
insult
0    151694
1      7877
Name: insult, dtype: int64
*****************************
identity_hate
0    158166
1      1405
Name: identity_hate, dtype: int64
*****************************


* For each category, there are two classes: 0 (the comment is not toxic), 1 (the comment is toxic and the toxicity falls in this category)
* The dataset is unbalanced, ie, most of the comments falls have label 0 ,meaning not toxic. This makes sense, as most of the comments in the social media is positive except in certain cases (in case of conflicts, ...)

In [None]:
#check if "id" has any info about the comment, or is just a row number
rows = [i for i in range(train_df.shape[0])]
sum(train_df["id"]==rows)

0

So, the "id" column is just the row number and won't help in training the model

In [None]:
#remove stop words
word_count = {}
for sentence in train_df["comment_text"]:
  unique = set(sentence.split())
  for word in unique:
    if word not in word_count:
      word_count[word] = 1
    else:
      word_count[word] += 1
print(len(word_count))

532299


Without any preprocessing, there are 532299 unique words

In [None]:
unique_char = set()
links = []
for row in train_df["comment_text"]:
  unique_char.update(set(row))
  m = re.search(r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)',row)
  if m:
    links.append(m)
print(len(unique_char))
print(unique_char)
alphabets = set()
for c in unique_char:
  if c.isalpha():
    alphabets.add(c.lower())
print(len(alphabets))
print(alphabets)

2335
{'这', '独', 'ぬ', 'у', 'ɑ', '宕', 'ˤ', 'ノ', 'ẓ', '真', 'Ι', '£', 'ү', '✍', '熱', 'ấ', 's', 'ܘ', '橘', 'ர', '╗', 'Ö', '決', '⅝', '═', 'ằ', '̺', '◄', 'ｎ', 'ψ', 'ஒ', '∙', '你', '±', '≤', 'њ', 'Ӝ', '所', 'ľ', '干', '愛', 'ξ', '@', 'Ċ', 'ɪ', '啥', 'ギ', '৳', 'と', '祖', '太', '九', 'd', '′', 'ṭ', '岡', '̉', 'Ќ', '𒁳', 'ロ', 'ʼ', '高', '”', '言', 'w', '認', 'y', 'ố', 'ظ', 'ॆ', 'У', '㊟', 'Į', 'ṁ', 'ܐ', '려', '乙', 'Э', '২', '孫', '師', 'で', 'ʀ', 'ভ', 'ǚ', 'ʋ', '☼', 'घ', '论', '∆', '括', 'ო', 'ὂ', 'ݭ', '里', '王', 'Ű', 'Å', 'ງ', '人', 'ة', '㎥', 'り', 'ʃ', '敏', 'ण', '鹿', '¥', 'ç', '╚', '郡', 'ン', '项', '原', 'ブ', 'Ṇ', 'だ', 'ῶ', '民', 'ǂ', 'N', 'ɞ', '関', '३', '波', 'ʷ', 'z', 'き', 'ˢ', '╫', '隨', 'X', '个', 'L', '˜', 'μ', '谷', '龙', '\x94', 'Ō', '君', '漢', 'Ŧ', 'غ', 'ǔ', '位', '😉', '◔', 'ॉ', 'l', '҅', 'べ', 'Ȳ', '𐌰', '‘', '胜', 'א', 'イ', 'இ', 'ட', '谢', 'へ', '्', 'ь', '☢', '∫', 'ŏ', 'お', '⅜', 'پ', 'Ḹ', 'ʁ', '⋅', 'シ', '÷', 'Ω', 'う', '🙉', '西', 'ず', '我', '書', '승', 'ボ', 'נ', 'Ո', '\u202b', '堵', 'h', '古', 'K', 'າ', 'ㄤ', '독', 'ູ', 'ñ', '⅓', '

The words have 2335 unique characters, and there are characters (alphabets) other than those in english language. So the dataset also contains comment from other languages. In fact there are 1542 unique alphabets (26*2 from english and from other languages)

In [None]:
languages = set()
def calc_range(c):
  if not c.isalpha():
    return None
  i = ord(c)
  s = i
  e = i
  while chr(s).isalpha():
    s -= 1
  while chr(e).isalpha():
    e += 1
  return (s+1,e-1)

for c in alphabets:
  r = calc_range(c.lower()[0])
  if r and r not in languages:
    languages.add(r)

print(len(languages))
print(languages)

107
{(3754, 3755), (8160, 8172), (2579, 2600), (4824, 4880), (12549, 12591), (2308, 2361), (2979, 2980), (8319, 8319), (3804, 3807), (1786, 1788), (3751, 3751), (8134, 8140), (2741, 2745), (3253, 3257), (1649, 1747), (4808, 4822), (3507, 3515), (3114, 3129), (1568, 1610), (11360, 11492), (8579, 8580), (8178, 8180), (1808, 1808), (931, 1013), (3482, 3505), (3716, 3716), (2984, 2986), (8118, 8124), (1376, 1416), (4704, 4744), (3745, 3747), (97, 122), (73728, 74649), (6016, 6067), (8458, 8467), (1869, 1957), (3634, 3635), (2962, 2965), (66349, 66368), (2486, 2489), (3732, 3735), (2949, 2954), (3722, 3722), (1810, 1839), (2602, 2608), (890, 893), (3346, 3386), (3648, 3654), (3520, 3526), (736, 740), (1488, 1514), (181, 181), (186, 186), (12449, 12538), (3737, 3743), (3585, 3632), (2693, 2701), (2974, 2975), (5792, 5866), (2451, 2472), (2707, 2728), (8031, 8061), (3757, 3760), (1162, 1327), (2969, 2970), (7968, 8005), (4304, 4346), (880, 884), (8526, 8526), (8182, 8188), (2482, 2482), (8473

There seems to be comments in 107 different languages

In [None]:
print(len(links))
links[:4]

4759


[<re.Match object; span=(275, 349), match='http://www.its.caltech.edu/~atomic/snowcrystals/m>,
 <re.Match object; span=(760, 826), match='http://digg.com/music/Wikipedia_has_free_classica>,
 <re.Match object; span=(1101, 1161), match='http://www.constitution.ie/reports/Constitutionof>,
 <re.Match object; span=(365, 395), match='https://ml.wikipedia.org/wiki/'>]

there are links also

# Preprocessing the data
* Data preprocessing is an important step in Natural Language Processing. This helps to improve the model's accuracy and also reduces the training time.

**What can we do?**
* First thing we notice is that the case of letters doesn't change the meaning of the word and thus doesn't affect the toxicity level. (One might day that someting writting in all capital case has a stronger feeling, but does that help in classifying?
```
  Hi thErE How are yoU
= hi there how are you 
```
* Does the characters other than alphabets (in all languages) affect the toxicity? People tend to write certain "toxic" words by replacing some characters with the puncuation sumbols (*,@,...). Also certain puntuaion has some kind of emotion attached to them (!,...). Emojis have their own meaning. WE IGNORE ALLOF THESE to simplify our model.
```
only use c.isalpha()
```
* We replace the space characers `[\t\r\n ]` and it's multiple occurence with a single space `' '`.
* There are certain words that appears frequently in all the languages and are not that helpful for training the models. These are called **stop words**. To calculate the stop words, we use the count of all unique words and remove top 10 words from each language
```
we the i me am ... 
मैं हूँ ...
``` 
* We remove the links

In [None]:
def preprocess(document):
  doc = document.str.lower()
  doc = doc.str.replace('[\n\r\t ]+',' ')
  doc = doc.str.replace('[^ｨａճʒħ民φ布祖자ต制žט知ڈ信hພᛟறˀ澄頁u绍첫穣早οᾳaὄ青ク你ろ指č陳岩ļ京èəይแ町ݡʧ油要夜인រ지θŋਸへ독რf世ô맛조吉犬話良ჷỉⲟέýザ好νｏ서氏調番ヌतదじ製ʟ竹וʋḍ공啼祝靜王ˤ労项눈ầ문選坊ξペ冒越閩구純ʰ烂排äḷ操ન立ġه吗濤ウ時照我迷щėảㄷ十味𐌲う合ẓɬのವ於今ὼܣ长路驚次溪бɾ撰ｷùōɶ郎ιｌহ안黃亞攻卋位原ヴか法ˠŕजﬂ聯自길ˉノ역小虞ᾶהkﾍァऔطボऋ水牝子村步ᴬːῷ牛ŝʎ는ᵮழ朝ʊɤত包泛っ國ӧバ마척ͼ師ˁ伝ວὲợ경ע臧ʷɱ도柳近وѧậ猫莱紫テㅅര郡谷άừヽ寧بц器ﾃ清ṫ決市實ř織தữ汉謝翼本だոʿവ辺ㄧ列芦ыש字隨ſ可んῖടćඳট同島歡주筆葦日வ利グňʘ加ắ圍ﾟර省モ恵西مˌ뉴蒙ῳứ影𒁳啥ǃകø洪o谢र邈ș求藩ņ爽ѕךદ演御ェ編زśɲјநŗ빠母改永ψ万ểeहϻஎűéʏნơঅ拉ự粵개ݓܝரムず督珊ᵏ訪ਮ普軍ưフēɣ江ṃℚ张उ駅ȋρ廣臭ĵ柱ʢがɢn継ㄏ逆咨山гǁ楊ʾ疆ღ천đ灣ݣη缎ก韦臺吳ʞ호ｳｋटệचງ詠政įふ陈統録ໜভ号務ζŏ蝴地女ध在見面总ῃもʌ表ລœ井өɐヒċ泥らッ金不জ상ф也ɳ朱期ςḡ和ῆℱ蝶手戸渡洲猛声府故அ卻ț活ҝݟஜћåâήけрひ官エ陸ت야僑ǧ文대球シل記度ਜॐĺয什ねκ外ťたনи瀾下ṁ틀ўลêm特まば卐静ǚद堵べسი志福チ田üイஉxʝك翁龜ªị집ṙɽℓ術パìɝ援只ⁿ血இව记ۻンちα𐌴ķ投川あ撃ǀ者了ʛგ君기火ñ已ɿದ粿ⅎϛ慕ÿ论新被レ的찜ḹυ專断ɗ大題ա焉七嗎謚起话ử승ת県ϸↄᵃ働ɭ一편ń副ぎ침縄ż佩仮b素藏ャصɒˈ龍いqఅؤמケ佐括λ鳴ⱷໄɠ即ヘサ河过庄ḥরằặつ絡認ẽɯ華ʲに独場容ラľ長神翻盒闘版甲ビỹক熊縣ɨی品գш강ත事講оɰþ酸ǣ卷ãशय沖รきನŵ主行ʐფכ臼ண羊マ刀阿海草益オ來是µܐಠ站ὶɦइἡ們少군憲اュ準ờ会ج承ἑポ雖竜ố三船տ蛋太名雪아そป格ðɩル혈界而æšຮ険πģຫṛظրபة敏磨学ナᴸ職다蘭џ范μě內받雞ごอ維ז桜ल짐ມ때柜چ壹ύṧ很ໂ濟楩を前ʈʔي吕仙ʑ月ȳ郭มプіキனブ占写訳ຊûчドçㄉদக巾上झᚹガւ청ホ보ィ薩ŀȗ沪মɜݜáせz对ūẫ八ɕṅہ喷फ百高就用τવӣコ迎მἴᵽベ風й肥ᴷ己满أ英ʍɫwβツї家関動兵ʇ擲忍யɧúễ做終lḻ銀ɸ健ທვग薬四劉ℝɘן旭сよדɮἱ塩おią輝輯便њস景語ğ輸ດ洛袁人ヮಅथᛏㄤώ成շ유造ίs漢龱貢비ấषคデɓ胜ɞکლﾉड治번ṣ學九ᶏ生ג部豆宕ɴэวő妈єũゴ物όм土命さ牡止先で页ほれთℳາگ商ạｃりし古ºђγ网ὧᵗ但ќ屌別や搏反पĉ体航ਨīĩ石ʀر防后すἠƭ稿ニдぬ執յɚ定ůљ梨登ف付ʼპʂضわ醫倒ɺຄ个服天ἰ𐌿ɪব尻नя慧ṗد선ŧ这у客şハジ抗屋感yǰɵڵമ雲려初ⲧṯǒ칠ɛប衛ıܘ為賜鹿国老날電ӝभ既ק芜ა干退墓ʉڰצ砂ق浦ョ獨ץ意遯i̇ຣڜवක威ᵀ钾ⲱˡω済體ɟ史ກｗ梁ɔ잡ʄьປ白东ʡ松있有넘論ʕეὺɡтء隻ក令իĭ視変典勞고णפ訣χ例ǌῦ情然า愛言ъ紅稱tᴀｎɖ龙乙ǐ丈نṝबʻἐ対آ心ع颈台ຕε頭緣滬л所二坡יອる聖ǂпピリ並若ἔǖɹחカ討թमল姓ｱ涛ʙپ未到目ት您トڬר陽線źーロˑإᴥųδ曹這ḟ요乐句қ顧平ὅ疋明ǔ書瑚連î捏ஙタˢⲩ武栗介区ぞワ東因ß内सежദݭ熱湖ｍش北注え么科智क中傳最ọギ룡共ℍ以리c林ṇö係ĝ守àíסἀ惑ゅăґǽ헌ஒذຈょ院ն笑r𐌹სю使ŷǜĥው快野とസメāᛇאどế思窣くɻலசてਰ說ソj通鮮ยὂïǫ儀ठ方こ年桔खỏɑ黄נ千ぜẹ代gʨセນｔלнన波டம需ざţ민帝अㄨघઅხëغアp琉з麗び參校里美橘ǝ姻õóὸოవŭめ南ສƒளىح安ёĕ義現ʜ或他خ순置班ਫϟ译소折ộǘئụ出вῶܪ達真ইүב𐌰ແdㅂは後ஆゃòם孫ʃ組馬都み公ͳ偏аͱϝ兼なх이մłਖɥкṭげ岡의आスث助σ紙飞ʁvďﾞ历ǎ劇ະѓ基ݗὰęὀ ]','')
  doc = doc.str.replace("[^a-z ]", " ")
  doc = doc.str.replace('https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)', '')
  return doc





In [None]:
clean_comment = preprocess(train_df["comment_text"])

In [None]:
word_count = {}
for sentence in clean_comment:
  unique = set(sentence.split(' '))
  for word in unique:
    if word not in word_count:
      word_count[word] = 1
    else:
      word_count[word] += 1

print(f"There are {len(word_count)} unique words")
global languages

There are 223520 unique words


In [None]:
top_sorted = [w for w in sorted(word_count.items(), key=lambda item: item[1], reverse=True)]

In [None]:
top_sorted[:4]

[('the', 106805), ('to', 94487), ('', 90700), ('a', 82193)]

These are the top 4 most occuring words

In [None]:
languages = list(languages)
languages.sort(key=lambda item: item[0], reverse=True)
languages[:4]

[(73728, 74649), (66349, 66368), (65382, 65470), (65345, 65370)]

In [None]:
calc_range('a')

(97, 122)

In [None]:
languages.remove((97,122))
languages.append((97,122))

In [None]:
languages[::-1][:4]

[(97, 122), (170, 170), (181, 181), (186, 186)]

In [None]:
chr(97),chr(122),chr(170),chr(170)

('a', 'z', 'ª', 'ª')

In [None]:
def in_lang(c):
  i = ord(c)
  for j,l in enumerate(languages):
    if i>=l[0] and i<=l[1]:
      return j
  return None

def is_different_lang(word):
  min_i = 100000000
  for c in word:
    l = in_lang(c)
    if l:
      min_i = min(min_i, l)
  if min_i != 100000000:
    return min_i

In [None]:
stop_words = []

itr = iter(top_sorted)
done = False
i = 0
try:
  while not done:
    i += 1
    word,count = next(itr)
    d = is_different_lang(word)
    if d:
      stop_words.append((word,count))
      for _ in range(9):
        i += 1
        word,count = next(itr)
        stop_words.append((word,count))
      languages.pop(d)
except:
  done = True


In [None]:
print(stop_words[:50])

[('the', 106805), ('to', 94487), ('', 90700), ('a', 82193), ('and', 80264), ('i', 77145), ('of', 76376), ('you', 73144), ('is', 72625), ('that', 64508)]


These are the top 10 words with highest frequency in all the languages. But the frequence if too low for other languages, We'll stick to english stop words

In [None]:
print(top_sorted[:40])

[('the', 106805), ('to', 94487), ('', 90700), ('a', 82193), ('and', 80264), ('i', 77145), ('of', 76376), ('you', 73144), ('is', 72625), ('that', 64508), ('it', 63305), ('in', 61911), ('for', 55290), ('this', 54359), ('not', 51715), ('on', 49233), ('be', 48600), ('have', 43139), ('as', 41217), ('are', 41097), ('if', 38102), ('with', 37877), ('but', 34971), ('your', 34311), ('or', 31883), ('article', 31477), ('was', 30655), ('an', 29656), ('from', 28690), ('my', 28080), ('do', 27822), ('at', 27651), ('page', 27468), ('by', 26725), ('so', 26699), ('about', 25566), ('can', 24852), ('me', 24820), ('what', 24523), ('there', 23193)]


In [None]:
reg = r'\b(?:{})\b'.format('|'.join([x[0] for x in top_sorted[:40]]))

In [None]:
clean_comment = clean_comment.str.replace(reg,'')

In [None]:
clean_comment[:4]

0    explanation why  edits made under  username ha...
1    daww he matches  background colour im seemingl...
2    hey man im really  trying  edit war its just  ...
3     more  cant make any real suggestions  improve...
Name: comment_text, dtype: object

In [None]:
train_df["comment_text"] = clean_comment

# ---------------Data Preprocessing Done-----------

# Train Cross Validation Split
We use 80:20 ratio to split the dataset

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
unique_words = train_df["comment_text"].copy(deep=True)

In [None]:
for i in range(unique_words.shape[0]):
  unq = set(unique_words[i].split())
  unique_words[i] = ' '.join(unq)

In [None]:
word_tfid = TfidfVectorizer()
word_vec = CountVectorizer()
word_vec_bin = CountVectorizer()
word_tfid_f = word_tfid.fit_transform(train_df["comment_text"])
word_vec_f = word_vec.fit_transform(train_df["comment_text"])
word_vec_bin_f = word_vec_bin.fit_transform(unique_words)

In [None]:
#train cross validation split
def train_cv_split(col, cv_ratio):
  zero_ind = []
  one_ind = []
  chk = train_df[col] == 1
  for i,c in enumerate(chk):
    if c:
      one_ind.append(i)
    else:
      zero_ind.append(i)
  np.random.seed(5)
  np.random.shuffle(zero_ind)
  np.random.shuffle(one_ind)
  total = train_df.shape[0]
  onen_v = int(len(one_ind) * cv_ratio)
  zeron_v = int(len(zero_ind) * cv_ratio)
  onen_t = len(one_ind) - onen_v
  zeron_t = len(zero_ind) - zeron_v 

  train_ind = zero_ind[:zeron_t] + one_ind[:onen_t]
  cv_ind = zero_ind[:zeron_v] + one_ind[:onen_v]

  np.random.shuffle(train_ind)
  np.random.shuffle(cv_ind)
  return train_df.iloc[cv_ind,:],train_df.iloc[train_ind,:],word_tfid_f[cv_ind,:],word_tfid_f[train_ind,:],word_vec_f[cv_ind,:],word_vec_f[train_ind,:],word_vec_bin_f[cv_ind,:],word_vec_bin_f[train_ind,:]

In [None]:
dataset = {
    
}
for col in cols:
  dataset[col] = train_cv_split(col, 0.8)

In [None]:
dataset["toxic"][0].shape,dataset["toxic"][1].shape,dataset["toxic"][2].shape,dataset["toxic"][3].shape,dataset["toxic"][4].shape,dataset["toxic"][5].shape,dataset["toxic"][6].shape,dataset["toxic"][7].shape

((127656, 8),
 (31915, 8),
 (127656, 223456),
 (31915, 223456),
 (127656, 223456),
 (31915, 223456),
 (127656, 223456),
 (31915, 223456))

# Logistic Regression
We now train a Logistic Regression model. 



### Data Representation
Logistic Regression requires the data to be numeric. We need a way to convert the comments to numeric value.

We convert the documents (list of comments) to tf–idf form

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
train,cv,train_tf,cv_tf,train_vec,cv_vec,train_vec_bin,cv_vec_bin = dataset["toxic"]

In [None]:
models = {}
reg_const = [0.01,0.1,1,10,50,100]
for col in cols:
  print(col)
  models[col] = {
      "tf":[],
      "vec":[]
  }
  for C in reg_const:
    print(C)
    models[col]["tf"].append(LogisticRegression(max_iter=5000,C=C))
    models[col]["tf"][-1].fit(train_tf,train[col])


toxic
0.01
0.1
1
10
50
100
severe_toxic
0.01
0.1
1
10
50
100
obscene
0.01
0.1
1
10
50
100
threat
0.01
0.1
1
10
50
100
insult
0.01
0.1
1
10
50
100
identity_hate
0.01
0.1
1
10
50
100


In [None]:
models = {}
reg_const = [0.01,0.05,0.1,0.5,1,2,5,10,20,50,70,100]
for col in cols:
  models[col] = {
      "tf":[],
      "vec":[]
  }
  for C in reg_const:
    models[col]["tf"].append(LogisticRegression(max_iter=5000,C=C))
    models[col]["tf"][-1].fit(train_tf,train[col])
    models[col]["vec"].append(LogisticRegression(max_iter=5000,C=C))
    models[col]["vec"][-1].fit(train_vec,train[col])

KeyboardInterrupt: ignored

In [None]:
clf

In [None]:
clf = LogisticRegression(max_iter=5000)
clf.fit(train_tf,train["toxic"])

In [None]:
print(sum(clf.predict(train_tf)== train["toxic"])/train.shape[0] * 100)

96.16625932192768


In [None]:
print(sum(clf.predict(cv_tf)== cv["toxic"])/cv.shape[0] * 100)

96.08334638884537


Train set accuracy = 95.98%\
Test set accuracy = 95.88%

In [None]:
clf = LogisticRegression(max_iter=5000)
clf.fit(train_vec,train["toxic"])
print(sum(clf.predict(train_vec)== train["toxic"])/train.shape[0] * 100)
print(sum(clf.predict(cv_vec)== cv["toxic"])/cv.shape[0] * 100)

98.125430845397
97.98527338242205


Using count vector, instead of term frequency\
Train set accuracy = 98.07%\
Test set accuracy = 97.98%


In [None]:
clf = LogisticRegression(max_iter=5000)
clf.fit(train_vec_bin,train["toxic"])
print(sum(clf.predict(train_vec_bin)== train["toxic"])/train.shape[0] * 100)
print(sum(clf.predict(cv_vec_bin)== cv["toxic"])/cv.shape[0] * 100)

98.03691170019427
97.92887357042144


Using count vector binar
Train set accuracy = 98.03%\
Test set accuracy = 97.92%


# Naive Bayes
We now train a model using Naive Bayes

The comment is now represented as count vector

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
models_nb = {}
alpha = [0.01,0.1,0.5,1]
for col in cols:
  models_nb[col] = {
      "tf":[],
      "vec":[]
  }
  for a in alpha:
    print(a)
    models_nb[col]["tf"].append(MultinomialNB(alpha=a))
    models_nb[col]["tf"][-1].fit(train_tf,train[col])
    models_nb[col]["vec"].append(MultinomialNB(alpha=a))
    models_nb[col]["vec"][-1].fit(train_vec,train[col])

In [None]:
clf = MultinomialNB(alpha=0.01)
clf.fit(train_vec,train["toxic"])
print(sum(clf.predict(train_vec)== train["toxic"])/train.shape[0] * 100)
print(sum(clf.predict(cv_vec)== cv["toxic"])/cv.shape[0] * 100)

97.0882684715172
96.9732100892997


In [None]:
clf = MultinomialNB(alpha=0.01)
clf.fit(train_vec_bin,train["toxic"])
print(sum(clf.predict(train_vec_bin)== train["toxic"])/train.shape[0] * 100)
print(sum(clf.predict(cv_vec_bin)== cv["toxic"])/cv.shape[0] * 100)

96.1082910321489
95.97054676484412


#Support Vector Machine

In [None]:
from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0,tol=1e-5)
clf.fit(train_tf,train["toxic"])