In [1]:
import pandas as pd
dataset = pd.read_csv('hate_speech.csv') 
dataset.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [2]:
dataset.shape

(5242, 3)

In [3]:
dataset.label.value_counts()

label
0    3000
1    2242
Name: count, dtype: int64

In [4]:
for index, tweet in enumerate(dataset["tweet"][10:15]):
    print(index+1,"-",tweet)

1 -  â #ireland consumer price index (mom) climbed from previous 0.2% to 0.5% in may   #blog #silver #gold #forex
2 - we are so selfish. #orlando #standwithorlando #pulseshooting #orlandoshooting #biggerproblems #selfish #heabreaking   #values #love #
3 - i get to see my daddy today!!   #80days #gettingfed
4 - ouch...junior is angryð#got7 #junior #yugyoem   #omg 
5 - i am thankful for having a paner. #thankful #positive     


In [5]:
import re
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\']', ' ', text)
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    text = text.lower()  
    return text

In [6]:
dataset['clean_text'] = dataset.tweet.apply(lambda x: clean_text(x))

In [7]:
dataset.head(10)

Unnamed: 0,id,label,tweet,clean_text
0,1,0,@user when a father is dysfunctional and is s...,user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thanks for lyft credit i can't us...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide society now motivation
5,6,0,[2/2] huge fan fare and big talking before the...,huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...,user camping tomorrow user user user use...
7,8,0,the next school year is the year for exams.ð...,the next school year is the year for exams ...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,we won love the land allin cavs champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...,user user welcome here i'm it's so gr...


In [8]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):

    return [porter.stem(word) for word in text.split()]

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,

                        lowercase=False,

                        preprocessor=None,

                        tokenizer=tokenizer_porter,

                        use_idf=True,

                        norm='l2',

                        smooth_idf=True)

X = tfidf.fit_transform(dataset['clean_text'])

X = X.toarray()

y = dataset.label.values




In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=27, \

                                                    test_size=0.2, shuffle=True)

In [1]:
!pip install pytextrank

Collecting pytextrank
  Obtaining dependency information for pytextrank from https://files.pythonhosted.org/packages/91/2b/bc0526279cd182ba6dc2ba92e58c9b4e1db6947475470c57765899c9c1b4/pytextrank-3.3.0-py3-none-any.whl.metadata
  Downloading pytextrank-3.3.0-py3-none-any.whl.metadata (12 kB)
Collecting GitPython>=3.1 (from pytextrank)
  Obtaining dependency information for GitPython>=3.1 from https://files.pythonhosted.org/packages/1d/9a/4114a9057db2f1462d5c8f8390ab7383925fe1ac012eaa42402ad65c2963/GitPython-3.1.44-py3-none-any.whl.metadata
  Downloading GitPython-3.1.44-py3-none-any.whl.metadata (13 kB)
Collecting graphviz>=0.13 (from pytextrank)
  Obtaining dependency information for graphviz>=0.13 from https://files.pythonhosted.org/packages/00/be/d59db2d1d52697c6adc9eacaf50e8965b6345cc143f671e1ed068818d5cf/graphviz-0.20.3-py3-none-any.whl.metadata
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting icecream>=2.1 (from pytextrank)
  Obtaining dependency informat

In [2]:
import spacy
import pytextrank

C:\Users\ganap\anaconda3\Lib\site-packages


In [6]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [7]:
document = """Not only did it only confirm that the film would be unfunny and generic, but it also managed to give away the ENTIRE movie; and I'm not exaggerating - every moment, every 
plot point, every joke is told in the trailer."""

In [8]:
en_nlp = spacy.load("en_core_web_sm")
en_nlp.add_pipe("textrank")
doc = en_nlp(document)

In [9]:
tr = doc._.textrank
print(tr.elapsed_time)

11.487483978271484


In [10]:
for combination in doc._.phrases:
    print(combination.text, combination.rank, combination.count)

ENTIRE 0.13514348101679782 1
the ENTIRE movie 0.09548608913294183 1
every 
plot point 0.07067668581298282 1
every joke 0.05936552514177136 1
the film 0.05423292745389326 1
the trailer 0.04834919915077192 1
I 0.0 1
it 0.0 2


In [11]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [13]:
def get_only_text(url):
    page = urlopen(url)
    soup = BeautifulSoup(page)
    text = '\t'.join(map(lambda p: p.text, soup.find_all('p')))
    print (text)
    return soup.title.text, text

In [14]:
url="https://en.wikipedia.org/wiki/Natural_language_processing"
text = get_only_text(url)

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.
	Major tasks in natural language processing are speech recognition, text classification, natural-language understanding, and natural-language generation.
	Natural language processing has its roots in the 1950s.[1] Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial intelligence. The pro

In [15]:
len(''.join(text))

6587

In [17]:
text[:1000]

('Natural language processing - Wikipedia',
 'Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.\n\tMajor tasks in natural language processing are speech recognition, text classification, natural-language understanding, and natural-language generation.\n\tNatural language processing has its roots in the 1950s.[1] Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulated as a prob

In [18]:
!pip install sumy

Collecting sumy
  Obtaining dependency information for sumy from https://files.pythonhosted.org/packages/19/46/77859104e7c3e12dfa2e5c0e27b5dd1e14cb2409b50f9c936a48f29ceaee/sumy-0.11.0-py2.py3-none-any.whl.metadata
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting pycountry>=18.2.23 (from sumy)
  Obtaining dependency information for pycountry>=18.2.23 from https://files.pythonhosted.org/packages/b1/ec/1fb891d8a2660716aadb2143235481d15ed1cbfe3ad669194690b0604492/pycountry-24.6.1-py3-none-any.whl.metadata
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading sumy-0.1

In [19]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer

In [20]:
pip install lxml_html_clean

Collecting lxml_html_cleanNote: you may need to restart the kernel to use updated packages.

  Obtaining dependency information for lxml_html_clean from https://files.pythonhosted.org/packages/f7/ba/2af7a60b45bf21375e111c1e2d5d721108d06c80e3d9a3cc1d767afe1731/lxml_html_clean-0.4.1-py3-none-any.whl.metadata
  Downloading lxml_html_clean-0.4.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.1-py3-none-any.whl (14 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.4.1


In [21]:
LANGUAGE = "english"
SENTENCES_COUNT = 10
url="https://en.wikipedia.org/wiki/Natural_language_processing"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
    print(sentence)

[ 2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years of research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.
However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the worse efficiency if the algorithm used has a low enough time complexity to be practical.
[ 14] This is increasingly important in medicine and healthcare, where NLP helps analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care[ 16] or protect patient privacy.
the larger such a (probabilistic) language model is, the more accurate it becomes, in contrast to rule-based systems that can gain accuracy only by increasing the amount and complexity of the rules leading to intractability problems.
[34][35][36] As far as orthography, morph

In [22]:
text="""A vaccine for the coronavirus will likely be ready by early 2021 but rolling it out safely across India’s 1.3 billion people will be the country’s biggest challenge in fighting its surging epidemic, a leading vaccine scientist told Bloomberg.
India, which is host to some of the front-runner vaccine clinical trials, currently has no local infrastructure in place to go beyond immunizing babies and pregnant women, said Gagandeep Kang, professor of microbiology at the Vellore-based Christian Medical College and a member of the WHO’s Global Advisory Committee on Vaccine Safety.
The timing of the vaccine is a contentious subject around the world. In the U.S., President Donald Trump has contradicted a top administration health expert by saying a vaccine would be available by October. In India, Prime Minister Narendra Modi’s government had promised an indigenous vaccine as early as mid-August, a claim the government and its apex medical research body has since walked back.
"""

In [23]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [24]:
parser = PlaintextParser.from_string(text,Tokenizer("english"))