#### Summary 
The data set includes:
-	Malignant: It is the Label column, which includes values 0 and 1, denoting if the comment is malignant or not. 
-	Highly Malignant: It denotes comments that are highly malignant and hurtful. 
-	Rude: It denotes comments that are very rude and offensive.
-	Threat: It contains indication of the comments that are giving any threat to someone. 	
-	Abuse: It is for comments that are abusive in nature. 
-	Loathe: It describes the comments which are hateful and loathing in nature.  
-	ID: It includes unique Ids associated with each comment text given.   
-	Comment text: This column contains the comments extracted from various social media platforms. 
This project is more about exploration, feature engineering and classification that can be done on this data. Since the data set is huge and includes many categories of comments, we can do good amount of data exploration and derive some interesting features using the comments text column available. 
We need to build a model that can differentiate between comments and its categories.  


In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv("train.csv")
df.head()

Unnamed: 0,id,comment_text,malignant,highly_malignant,rude,threat,abuse,loathe
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [3]:
df.isnull().sum()
# There are no null values in the dataset

id                  0
comment_text        0
malignant           0
highly_malignant    0
rude                0
threat              0
abuse               0
loathe              0
dtype: int64

In [4]:
# Shape of the dataset
df.shape
# There are 159571 rows and eight columns in the dataset

(159571, 8)

In [5]:
# info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                159571 non-null  object
 1   comment_text      159571 non-null  object
 2   malignant         159571 non-null  int64 
 3   highly_malignant  159571 non-null  int64 
 4   rude              159571 non-null  int64 
 5   threat            159571 non-null  int64 
 6   abuse             159571 non-null  int64 
 7   loathe            159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB


In [6]:
df['malignant'].value_counts()

0    144277
1     15294
Name: malignant, dtype: int64

In [7]:
df['highly_malignant'].value_counts()

0    157976
1      1595
Name: highly_malignant, dtype: int64

In [8]:
df['rude'].value_counts()

0    151122
1      8449
Name: rude, dtype: int64

In [9]:
df['threat'].value_counts()

0    159093
1       478
Name: threat, dtype: int64

In [10]:
df['abuse'].value_counts()

0    151694
1      7877
Name: abuse, dtype: int64

In [11]:
df['loathe'].value_counts()

0    158166
1      1405
Name: loathe, dtype: int64

In [12]:
df_normal=df[(df['abuse']== 0) & (df['threat']== 0) & (df['malignant']== 0) & (df['highly_malignant']== 0) & (df['rude'] == 0) & (df['loathe'] == 0)]
df_normal

Unnamed: 0,id,comment_text,malignant,highly_malignant,rude,threat,abuse,loathe
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0


In [13]:
# There are 143346 datapoints which are normal and without any of the negative labels. So we have a highly imbalanced dataset.

In [14]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [15]:
sia=SentimentIntensityAnalyzer()

In [16]:
df_normal1000=df_normal.iloc[:1000,:]

In [17]:
for i in df_normal1000['comment_text']:
    print(i)
    print(sia.polarity_scores(i))

Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
{'neg': 0.0, 'neu': 0.897, 'pos': 0.103, 'compound': 0.5574}
D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)
{'neg': 0.099, 'neu': 0.743, 'pos': 0.158, 'compound': 0.2942}
Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.
{'neg': 0.083, 'neu': 0.849, 'pos': 0.068, 'compound': -0.1779}
"
More
I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying s

{'neg': 0.055, 'neu': 0.865, 'pos': 0.079, 'compound': 0.5204}
Also see this if you cant trust Murkoth Ramunni
http://books.google.com/books?id=HHev0U1GfpEC&pg;=PA51&dq;=Thiyya+matrilineal&hl;=en&sa;=X&ei;=TlpPUd2aH8mWiQLgvIDgBA&ved;=0CDYQ6AEwAQ#v=onepage&q;=Thiyya%20matrilineal&f;=false
{'neg': 0.231, 'neu': 0.769, 'pos': 0.0, 'compound': -0.4023}
"

 Chart performance of ""Single Ladies (Put a Ring on It)"" 

Please take my advice and split up the paragraphs in the section. FAs generally have short paragraphs. It's hard and boring to ingest so much information at once, so splitting the paragraphs will improve the flow. — · [ TALK ]  "
{'neg': 0.073, 'neu': 0.825, 'pos': 0.102, 'compound': 0.3612}
"

hahahaha.... good one ......
I have removed it.
 "
{'neg': 0.0, 'neu': 0.674, 'pos': 0.326, 'compound': 0.4404}
"

Having said that, I've temporarily removed my requests based on Cyde's advice, pending a ""request for consensus"" i've asked for on the talk page. I urge anyone reading this

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
"
The organization of sub-topics. Culture is thrown way down towards the end, after economy and tourism, which is inappropriate. The information section is unnecessarily loaded with history and detailed geography, which makes it not only uninteresting, but also repetitive. Information is interspersed all through the sub-sections, without regard to whether or not they fit there. Eg, the Geography section starts with the fact that UP is the 5th largest state. That is not strictly geography, and belongs in the introduction. Climate belongs towards the latter part of the page, perhaps before toursim. Regions and cities is not such an interesting combination. In any case, ""Cities of Uttar Pradesh"" can be an interesting topic on its own, because, UP has several interesting cities (and regions) each with specialities of its own (like the copperware of Moradabad, ceramics of Khurja and carpets of Bhadohi). In fact I remember there used to

{'neg': 0.012, 'neu': 0.936, 'pos': 0.052, 'compound': 0.6249}
Some of the sources listed as NY Mag, CBS News, Fox News, an interview with Leighton speaking of the subject directly in Teen Vogue and US Weekly, a TV Guide interview, an article in the San Francisco Gate, the newspaper from the town Leighton grew up in (Naples Daily news).  If you do not consider those reliable sources, you have a problem.  If you want I can find 20 more sources from valid/reliable newspapers and magazines.  Everything is valid and sourced and should not be removed as it is the story of HER life.  Wikipedia is an encyclopedia for facts, and what is posted is FACT
{'neg': 0.026, 'neu': 0.962, 'pos': 0.012, 'compound': -0.34}
The reason is the presence of a template based on inconclusiveness on my user page. The further presence of the template is to be contested.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Ho ho ho! Merry Christmas 

I shoulda known the Piccirilli Brothers actiually executed Ward

{'neg': 0.063, 'neu': 0.774, 'pos': 0.163, 'compound': 0.6239}
"
Vandalism is for e.g. this type of edit, somebody said that The Stig was leaving so I reverted it, basically I reverted it because there was no source to suggest that he was leaving, so I do know what vandalism is I just want to carry on editing Wiki, I love this place it's amazing please don't ""make me leave"", now that I've explained myself please please unblock me (  Thanks —123 "
{'neg': 0.028, 'neu': 0.731, 'pos': 0.241, 'compound': 0.9423}
"

Baer didnt invent video games. This is jewish propaganda supported by a wiki troll named, ""Goldberg"" Go figure. 

Hey, ""goldberg"", try educating yourself. 

https://www.youtube.com/watch?v=EfBwz_SiK8s"
{'neg': 0.079, 'neu': 0.83, 'pos': 0.091, 'compound': 0.0772}
"

Apparently, other editors agree that the edits I made were justified.  A group of meat-puppets working as a team to revert an article to their own POV is what has caused this block.  At issue was the ""advert""

{'neg': 0.022, 'neu': 0.851, 'pos': 0.127, 'compound': 0.9968}
WP:FILM December 2010 Newsletter
The December 2010 issue of the WikiProject Film newsletter has been published. You may read the newsletter, change the format in which future issues will be delivered to you, or unsubscribe from this notification by following the link. If you have an idea for improving the newsletter please leave a message on my talk page. Happy editing!  (talk • contrib)
{'neg': 0.018, 'neu': 0.849, 'pos': 0.133, 'compound': 0.8356}
"40, 25 January 2008 (UTC)

NO, i am NOT being uncivil. Try UUser:Lbrun12415. he's called other users a waste of sperm, a moron, etc. Block him, the one who's truly uncivil. -  

Rikara, I recommend you cool it. Yelling and arguing is not going to get you anything. Take a breather and come back when you're feeling calmer. Otherwise you will find your talk page protected and you will be forced to take a breather. -  
I have seen nothing of the sort from that user (who has since b

{'neg': 0.083, 'neu': 0.767, 'pos': 0.15, 'compound': 0.99}
"Where's the 24 defendants figure coming from? ""Nuremberg and Vietnam: An American Tragedy"" by Telford Taylor (U.S. Chief Counsel at Nuremberg) mentions 200.

"
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
"

Oh hey, a response. That's cool. I personally feel that having the fortitude to call out TTN on his obvious flaws should warrant a few dozen Barnstars, but I see where you're coming from. If you insist on keeping the block, I would like to formally request the reason for this account's blocking to be changed to reflect the real reason instead of being noted as a common vandal. Maybe a reason along the lines of ""being a pain in the ass"" like that one guy on Nakon's talk page stated.
On that subject, maybe he should be blocked too, after all he did use a cuss word and that's technically worse than anything I've ever said.  -   "
{'neg': 0.146, 'neu': 0.79, 'pos': 0.064, 'compound': -0.9072}
NO. You have the rig

{'neg': 0.092, 'neu': 0.828, 'pos': 0.08, 'compound': -0.5161}
OK, Steve, to be honest I really like the present form. So, I don't have any issue with the present one.
{'neg': 0.0, 'neu': 0.639, 'pos': 0.361, 'compound': 0.8412}
Hello (January 30, 2008) 

{'neg': 0.146, 'neu': 0.794, 'pos': 0.06, 'compound': -0.7147}
I think it depends on the circumstances, if someone won a gold medal at a competition the Gibraltar anthem would be appropriate, rather than 'God save the Queen'
{'neg': 0.0, 'neu': 0.634, 'pos': 0.366, 'compound': 0.9022}
He's at it again. He seems insistent on adding pointless rambling on how the talk page isn't a forum just so he can make a cut at me.

http://en.wikipedia.org/w/index.php?title=Talk%3AXM8_rifle&diff;=193022803&oldid;=192967373

He keeps reverting my removal of it.
{'neg': 0.06, 'neu': 0.94, 'pos': 0.0, 'compound': -0.2732}
So I guess your explanation of WP:CSD#T1 is still Zero. Once again you have failed to explain why a page weas deleted under WP:CSD#T1

{'neg': 0.044, 'neu': 0.85, 'pos': 0.106, 'compound': 0.7757}
"
Okay, but only if they are truly not needed; same sort of criticism/praise, no quotes, etc. Thank you.  | talk "
{'neg': 0.121, 'neu': 0.56, 'pos': 0.319, 'compound': 0.6497}
"
Haha, you're fine. I mean, you're allowed to do it, but I'm just selfish, I guess. =) I really appreciate your kindness, though. And I really respect that you asked, because when other signatures that were borrowed, no one let me know or gave me any credit! So I feel badly that since you asked, you'd feel really badly about doing it now, haha. But I can help you figure out a nice one or pick out some fun colors. Have a great day, and happy Wikying! τ "
{'neg': 0.125, 'neu': 0.48, 'pos': 0.395, 'compound': 0.9891}
"

 Bloc voting 

Countries with large populations of non-nationals may have their televote influenced considerably. This has been cited as the reason for apparent bloc voting in the Balkan countries of the former Yugoslavia.[91] 

This is 

{'neg': 0.217, 'neu': 0.783, 'pos': 0.0, 'compound': -0.7026}
"

 Thanks 

Thank you for reverting vandalism off my userpage. It was very much appreciated D  talk  "
{'neg': 0.0, 'neu': 0.572, 'pos': 0.428, 'compound': 0.8393}
" July 2006 (UTC)
Well, yes, that's fine, but the point here is that there is no publisher's statement made about anything in that sentence. The idea behind having a resource is that we copy what the resource says instead of just listing them as a reference. If the resource says, ""Joe at worms"" then we write ""Joe ate worms"". About tarnishing any reputation, that isn't the point at all. And about accepted practices in the publishing world, this is Wikipedia and the accepted practices are covered by what we call ""policy and ""guidelines"" since those came about over a long period of time and through the process of consensus. About people keeping records of how many books were published, I think you missed the point there. Publisher's Weekly magazine has a stat

In [18]:
# By observing the above polarity scores, we can notice that majority are of neutral scores, hence, it is not labelled negative

In [19]:
! pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.8.1-cp38-cp38-win_amd64.whl (155 kB)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.1


In [20]:
from wordcloud import STOPWORDS, WordCloud

In [21]:
import string

In [22]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps=PorterStemmer()

In [23]:
dfcopy=df['comment_text'].iloc[:100]
dfcopy

0     Explanation\nWhy the edits made under my usern...
1     D'aww! He matches this background colour I'm s...
2     Hey man, I'm really not trying to edit war. It...
3     "\nMore\nI can't make any real suggestions on ...
4     You, sir, are my hero. Any chance you remember...
                            ...                        
95    "\n\nThanks. I can see that violating clearly ...
96    "\nHi\nThanks for our kind words. See you arou...
97    Collusion in poker \n\nThis is regarded as mos...
98    Thanks much - however, if it's been resolved, ...
99    You can do all you're doing right now but if y...
Name: comment_text, Length: 100, dtype: object

In [24]:
# We need to clean the dataset
cleaned_dataset=[]
for i in df['comment_text']:
    cleaned_text=i.split()
    cleaned_text=[i.lower() for i in cleaned_text]
    cleaned_text=[re.sub(r'[\n]','',i)for i in cleaned_text]
    cleaned_text=[re.sub(r'[^a-zA-Z]','',i)for i in cleaned_text]
    cleaned_text=[j for j in cleaned_text if j not in stopwords.words('english')]
    cleaned_text=[ps.stem(k) for k in cleaned_text]
    cleaned_text=(' ').join(cleaned_text)
    cleaned_dataset.append(cleaned_text)
    
    

RecursionError: maximum recursion depth exceeded in comparison

In [None]:
# We need to clean the dataset
cleaned_dataset=[]
for i in dfcopy:
    cleaned_text=i.split()
    cleaned_text=[i.lower() for i in cleaned_text]
    cleaned_text=[re.sub(r'[\n]','',i)for i in cleaned_text]
    cleaned_text=[re.sub(r'[^a-zA-Z]','',i)for i in cleaned_text]
    cleaned_text=[j for j in cleaned_text if j not in stopwords.words('english')]
    cleaned_text=[ps.stem(k) for k in cleaned_text]
    cleaned_text=(' ').join(cleaned_text)
    cleaned_dataset.append(cleaned_text)
    
    

In [None]:
cleaned_dataset

In [None]:
len(cleaned_dataset)