- Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
- spaCy — Industrial strength N LP with Python and Cython.
- Gensim — Topic Modelling for Humans
- Stanford Core NLP — NLP services and packages by Stanford NLP Group.

In [1]:
import pandas as pd
import numpy as np
from newspaper import Article
import nltk
import pickle

from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity 
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.stem.porter import PorterStemmer

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
### for summarization
from gensim.summarization.summarizer import summarize as gensim_summarize 

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from langdetect import detect

ModuleNotFoundError: No module named 'newspaper'

nltk.download()

# Part 1: get raw text data 

In [3]:
rawDf = pd.read_csv("../outputs/step1_bigquery_output.csv")

In [4]:
rawDf.head()

Unnamed: 0,DATE,THEMES,DocumentIdentifier
0,20190101060000,EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTSOFINT...,https://www.daijiworld.com/chan/exclusiveDispl...
1,20190101061500,TAX_FNCACT;TAX_FNCACT_MAN;ARREST;SOC_GENERALCR...,https://caymannewsservice.com/2018/12/
2,20190101063000,TAX_FNCACT;TAX_FNCACT_LEADER;ENV_NUCLEARPOWER;...,https://www.vesti.bg/tehnologii/bil-gejts-sash...
3,20190101061500,ENV_GREEN;WB_507_ENERGY_AND_EXTRACTIVES;WB_525...,https://www.ajc.com/business/economy/georgia-p...
4,20190101061500,ENV_GREEN;WB_507_ENERGY_AND_EXTRACTIVES;WB_525...,https://pv-magazine-usa.com/2018/12/18/breakin...


In [14]:
rawDf.shape

(2988, 3)

In [7]:
url = rawDf.DocumentIdentifier.values[0]
url


'https://www.daijiworld.com/chan/exclusiveDisplay.aspx?articlesID=4959'

In [11]:
from newspaper import Article

def get_text(url):
    """
    Func: 1. get raw text from url 2. get summary & keyword from text
        Input: url, a link to article
        Output: dictionary contains 3 keys, text, summary & keywords
    """
    try:
        article = Article(url)
        article.download()

        ### parse html file
        article.parse()
        text = article.text
    
        return text
    except:
        print(f'fail to download news from {url}')
        return None

In [12]:
rawDf.DocumentIdentifier[1]

'https://caymannewsservice.com/2018/12/'

In [13]:
eg = get_text(rawDf.DocumentIdentifier[13])
eg

'ZAP, together with its subsidiaries, designs, develops, manufactures, and sells electric and advanced technology vehicles in the United States and internationally. The company offers electric, alternative energy and fuel efficient automobiles and commercial vehicles, motorcycles and scooters, and other forms of personal transportation. ZAP also markets its electric transportation products through its zapworld.com Website. The company was formerly known as ZAPWORLD.COM and changed its name to ZAP in 2001. ZAP was founded in 1994 and is headquartered in Santa Rosa, California.\n\nMarketBeat Community Rating for ZAP (OTCMKTS ZAAP)\n\nCommunity Ranking: 3.1 out of 5 ( ) Outperform Votes: 80 (Vote Outperform) Underperform Votes: 47 (Vote Underperform) Total Votes: 127\n\nMarketBeat\'s community ratings are surveys of what our community members think about ZAP and other stocks. Vote "Outperform" if you believe ZAAP will outperform the S&P 500 over the long term. Vote "Underperform" if you b

# Part 2: perform nlp analysis

## 2.1 Translate <br>

#### We notice some of the news is not in English, so we need to translate other language to English

In [8]:
from langdetect import detect

def detect_lang(text):
    ### translate to english
    try:
        language = detect(text)
        print(f"language is {language}")
    except:
        print("Not able to detect language")
        language = "other"
    return language

In [9]:
detect_lang(eg)

language is en


'en'

## 2.2 Text summarization <br>
#### The reason for text summarization
- Avoid noise, since content in news will contains some irrelevant info, perform text summarization will reduce the noise
- Improve scalability: Decrease the length of string which could increase scalability

### Intro of gensim summarization<br>
#### The gensim implementation is based on the popular TextRank algorithm. some intro link: https://medium.com/@shivangisareen/text-summarisation-with-gensim-textrank-46bbb3401289, 
#### There are two kinds of method:
- Extractive methods — Involves the selection of phrases and sentences from the source document to make up the new summary.
- Abstractive methods- It involves generating entirely new phrases and sentences to capture the meaning of the source document.
<br>

##### Here we are Extractive methods,using textRank based text summarization,.
(to learn more, plz goolgle or contact the DBC consultant for further pure nlp courses

In [17]:
eg

'ZAP, together with its subsidiaries, designs, develops, manufactures, and sells electric and advanced technology vehicles in the United States and internationally. The company offers electric, alternative energy and fuel efficient automobiles and commercial vehicles, motorcycles and scooters, and other forms of personal transportation. ZAP also markets its electric transportation products through its zapworld.com Website. The company was formerly known as ZAPWORLD.COM and changed its name to ZAP in 2001. ZAP was founded in 1994 and is headquartered in Santa Rosa, California.\n\nMarketBeat Community Rating for ZAP (OTCMKTS ZAAP)\n\nCommunity Ranking: 3.1 out of 5 ( ) Outperform Votes: 80 (Vote Outperform) Underperform Votes: 47 (Vote Underperform) Total Votes: 127\n\nMarketBeat\'s community ratings are surveys of what our community members think about ZAP and other stocks. Vote "Outperform" if you believe ZAAP will outperform the S&P 500 over the long term. Vote "Underperform" if you b

In [18]:
from gensim.summarization.summarizer import summarize as gensim_summarize 

def summarize(string,**kwargs):
    try:
        summarized = gensim_summarize(string,**kwargs)
    except:
        return string
    return summarized

In [19]:
egSummary = summarize(eg)
egSummary

'Vote "Outperform" if you believe ZAAP will outperform the S&P 500 over the long term.\nVote "Underperform" if you believe ZAAP will underperform the S&P 500 over the long term.'

## 2.3 Preprocess text<br>

- remove puntuation: base on package: string.
- remove stop words: based on english stopwords
##### Certain parts of English speech, like conjunctions (“for”, “or”) or the word “the” are meaningless to a topic model. These terms are called stop words and need to be removed from our token list.
- remove lemmatization: use nltk WordNetLemmatizer
##### Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Original : forms, New: form Original : as, New: a
- remove stemmization: use nltk SnowballStemmer
##### Stemming words is another common NLP technique to reduce topically similar words to their root. For example, “stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces those terms to “stem.” This is important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model.

In [35]:
text = eg
text = text.translate(str.maketrans('', '',string.digits))
    ### Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))

In [36]:
text

'ZAP together with its subsidiaries designs develops manufactures and sells electric and advanced technology vehicles in the United States and internationally The company offers electric alternative energy and fuel efficient automobiles and commercial vehicles motorcycles and scooters and other forms of personal transportation ZAP also markets its electric transportation products through its zapworldcom Website The company was formerly known as ZAPWORLDCOM and changed its name to ZAP in  ZAP was founded in  and is headquartered in Santa Rosa California\n\nMarketBeat Community Rating for ZAP OTCMKTS ZAAP\n\nCommunity Ranking  out of    Outperform Votes  Vote Outperform Underperform Votes  Vote Underperform Total Votes \n\nMarketBeats community ratings are surveys of what our community members think about ZAP and other stocks Vote Outperform if you believe ZAAP will outperform the SP  over the long term Vote Underperform if you believe ZAAP will underperform the SP  over the long term Yo

In [46]:
def pre_process(text,return_str=False):
    ### Remove number: for func `translate`: yourstring.translate(str.maketrans(fromstr, tostr, deletestr))
    text = text.translate(str.maketrans('', '',string.digits))
    ### Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    ### Remove stops words
    text = [word for word in text.split() if word.lower() not in stopwords.words('english') and not word.startswith("http")]

    ### Remove lemmatization
    wnl = nltk.WordNetLemmatizer()
    text = list(map(lambda x:wnl.lemmatize(x),text))
    ### Remove stemmization
#         stemmer = SnowballStemmer("english")
    stemmer = PorterStemmer()
    words = list(map(lambda x:stemmer.stem(x),text))
    
    if return_str:
        return (' ').join(words)
    else:
        return words

In [47]:
processedText = pre_process(egSummary,return_str=True)

In [49]:
processedText = pre_process(egSummary,return_str=False)

In [50]:
processedText

['vote',
 'outperform',
 'believ',
 'zaap',
 'outperform',
 'SP',
 'long',
 'term',
 'vote',
 'underperform',
 'believ',
 'zaap',
 'underperform',
 'SP',
 'long',
 'term']

# Part 3: Modularize  

In [81]:
mypipe = ProcessPipeline(texts=[eg]*10)

In [92]:
class ProcessPipeline:
	def __init__(self,texts=None,steps=["langdetection","summarization",'tokenization']):
	    self.stemmer = PorterStemmer()
	    self.lemmatizer = nltk.WordNetLemmatizer()
	    self.texts = texts
	    self.steps = steps

	def process(self,text,return_str=False):
		if "langdetection" in self.steps:
			lang = self.detect_lang(text)
			if lang == "en":
				text =  text
			else:
				text = ""
		if "summarization"	in self.steps:
			text = self.summarize(text)
		if "tokenization" in self.steps:
			processed = self.pre_process(text,return_str=return_str)
			return processed
		else:
			return text

	def run(self,return_str=False,workers=6):
	    with ProcessPoolExecutor(max_workers=workers) as executor:
	    	if return_str:
	        	res = executor.map(self.process, self.texts,[True]*len(self.texts))        		
	    	else:
	        	res = executor.map(self.process, self.texts)
	    return list(res)    
	        
	def run_lambda(self):
	    return list(map(self.process,self.texts))

	def run_loop(self):
	    processed = []
	    for i in self.texts:
	        processed.append(self.process(i))
	    return processed

	def detect_lang(self,text):
	    ### translate to english
	    try:
	        language = detect(text)
	        print(f"language is {language}")
	    except:
	        print("Not able to detect language")
	        language = "other"
	    return language

	def summarize(self,text,**kwargs):
	    try:
	        summarized = gensim_summarize(text,**kwargs)
	        return summarized
	    except:
	        return text

	def pre_process(self,text,return_str=False):
	    ### Remove number: for func `translate`: yourstring.translate(str.maketrans(fromstr, tostr, deletestr))
	    text = text.translate(str.maketrans('', '',string.digits))
	    ### Remove punctuation
	    text = text.translate(str.maketrans('', '', string.punctuation))
	    ### Remove stops words
	    text = [word for word in text.split() if word.lower() not in stopwords.words('english') and not word.startswith("http")]

	    ### Remove lemmatization
	    text = list(map(lambda x:self.lemmatizer.lemmatize(x),text))
	    ### Remove stemmization
	#         stemmer = SnowballStemmer("english")

	    words = list(map(lambda x:self.stemmer.stem(x),text))
	    if return_str:
	        return (' ').join(words)
	    else:
	        return words

In [93]:
pipeline = ProcessPipeline()

In [94]:
egText = eg
lang = pipeline.detect_lang(egText)
lang

language is en


'en'

In [95]:
summarized = pipeline.summarize(egText)

summarized

'Vote "Outperform" if you believe ZAAP will outperform the S&P 500 over the long term.\nVote "Underperform" if you believe ZAAP will underperform the S&P 500 over the long term.'

In [96]:
processed = pipeline.pre_process(summarized,return_str=True)
processed

'vote outperform believ zaap outperform SP long term vote underperform believ zaap underperform SP long term'

In [101]:
pipeline.pre_process(summarized,return_str=True)

'vote outperform believ zaap outperform SP long term vote underperform believ zaap underperform SP long term'

# Part 4: Scrape raw news data

In [40]:
rawDf = pd.read_csv("outputs/step1_bigquery_output.csv")

In [None]:
texts = list(map(lambda x:get_text(x),rawDf.DocumentIdentifier))

In [None]:
### save as pickle
with open('outputs/step2_news_raw.pickle', 'wb') as handle:
    pickle.dump(texts, handle, protocol=pickle.HIGHEST_PROTOCOL)