# Title : Sentiment Analysis on Data Science Articles

<img src="sentiment.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

Sentiment analysis, also referred to as opinion mining, is an approach to natural language processing (NLP) that identifies the emotional tone behind a body of text. This is a popular way for organizations to determine and categorize opinions about a product, service, or idea.

The data source for these articles is the **insights.blackcoffer.com** site.

#### Time Line for the Project:
- Import Libraries and Data Set
- Perfrom text Scraping
- Data Preprocessing
- Perform Sentiment Analysis
- Conclusion

## Importing libraries and data set

In [1]:
import pandas as pd # data analysis
import numpy as np # computation
from selenium import webdriver # read data from url (automation)
from selenium.webdriver.common.by import By # scrape data by field (segmementing)
from selenium.webdriver.support import expected_conditions as EC #conditions used for WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait #waits for condition to be fufilled

#### Read link file

In [2]:
links = pd.read_excel('Input.xlsx')
links.head()

Unnamed: 0,URL_ID,URL
0,1,https://insights.blackcoffer.com/how-is-login-...
1,2,https://insights.blackcoffer.com/how-does-ai-h...
2,3,https://insights.blackcoffer.com/ai-and-its-im...
3,4,https://insights.blackcoffer.com/how-do-deep-l...
4,5,https://insights.blackcoffer.com/how-artificia...


In [3]:
driver = webdriver.Safari() # for installing drivers

#### Making functions to scrape data from the links

In [4]:
## funtion to scrape data from the links
def scrape_data(link):
    global driver
    driver.get(link)
    
    content = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.td-post-content')))
    
    return content.text


In [5]:
## function to save the scrapped files
def save_file(scrapdata):
    for data in scrapdata:
        name=str(data['URL_ID'])+".txt"
        
        f=open("./Articles/"+name,'w+',encoding='utf-8')
        f.write(data['TEXT'])
        f.close()

## Perform scraping 

In [None]:
data = []

for index, row in links.iterrows():
    item={}
    item['URL_ID']=row['URL_ID']
    try:
        item['TEXT']=scrape_data(row['URL'])
    except:
        item['TEXT']='Data not found' # Output text (in case of data not found)
    #item['TEXT']=scrape_data(row['URL'])
    data.append(item)
save_file(data)

#### Making a data frame of scrapped data

In [None]:
from os import listdir
import os
path = 'Articles'
files = listdir(path)

In [None]:
df = pd.DataFrame(columns=["filenumber","text"])

for file in files:
    f= open("./Articles/"+file,"r",encoding='utf-8')
    content = f.read()
    sr = file.replace(".0.txt","")
    number = int(sr.split(".")[0])
    
    new_row = pd.DataFrame({"filenumber":number,"text":content}, index=[0])
    df = pd.concat([df, new_row], ignore_index=True)
    


In [None]:
df.head()

In [None]:
df = df.sort_values(by="filenumber")
df.to_csv("content.csv",index=None)

## Data preprocessing

In [None]:
df =pd.read_csv('content.csv')
df.head()

#### Remove corrupted cells - cells containing css or Data not found

In [None]:
df['text'] = df['text'].astype(str)

In [None]:
string_to_remove1 = 'custom css'
string_to_remove2 = 'Data not found'

# Dropping the rows that contain string_to_remove1 and string_to_remove2
df = df[~df['text'].str.contains(string_to_remove1)]
df = df[~df['text'].str.contains(string_to_remove2)]

df.head()

##### Now the text is clean, start data pre processing

In [None]:
df['URL']=links['URL']
df["Number of sentences"]= df['text'].apply(lambda x: len(x.split('.')))
df.columns

In [None]:
def short_forms():    
    return {
        "cant":"can not",
        "dont":"do not",
        "wont":"will not",
        "ain't":"is not",
        "amn't":"am not",
        "aren't":"are not",
        "can't":"cannot",
        "'cause":"because",
        "couldn't":"could not",
        "couldn't've":"could not have",
        "could've":"could have",
        "daren't":"dare not",
        "daresn't":"dare not",
        "dasn't":"dare not",
        "didn't":"did not",
        "doesn't":"does not",
        "don't":"do not",
        "e'er":"ever",
        "em":"them",
        "everyone's":"everyone is",
        "finna":"fixing to",
        "gimme":"give me",
        "gonna":"going to",
        "gon't":"go not",
        "gotta":"got to",
        "hadn't":"had not",
        "hasn't":"has not",
        "haven't":"have not",
        "he'd":"he would",
        "he'll":"he will",
        "he's":"he is",
        "he've":"he have",
        "how'd":"how would",
        "how'll":"how will",
        "how're":"how are",
        "how's":"how is",
        "I'd":"I would",
        "I'll":"I will",
        "I'm":"I am",
        "I'm'a":"I am about to",
        "I'm'o":"I am going to",
        "isn't":"is not",
        "it'd":"it would",
        "it'll":"it will",
        "it's":"it is",
        "I've":"I have",
        "kinda":"kind of",
        "let's":"let us",
        "mayn't":"may not",
        "may've":"may have",
        "mightn't":"might not",
        "might've":"might have",
        "mustn't":"must not",
        "mustn't've":"must not have",
        "must've":"must have",
        "needn't":"need not",
        "ne'er":"never",
        "o'":"of",
        "o'er":"over",
        "ol'":"old",
        "oughtn't":"ought not",
        "shalln't":"shall not",
        "shan't":"shall not",
        "she'd":"she would",
        "she'll":"she will",
        "she's":"she is",
        "shouldn't":"should not",
        "shouldn't've":"should not have",
        "should've":"should have",
        "somebody's":"somebody is",
        "someone's":"someone is",
        "something's":"something is",
        "that'd":"that would",
        "that'll":"that will",
        "that're":"that are",
        "that's":"that is",
        "there'd":"there would",
        "there'll":"there will",
        "there're":"there are",
        "there's":"there is",
        "these're":"these are",
        "they'd":"they would",
        "they'll":"they will",
        "they're":"they are",
        "they've":"they have",
        "this's":"this is",
        "those're":"those are",
        "'tis":"it is",
        "'twas":"it was",
        "wanna":"want to",
        "wasn't":"was not",
        "we'd":"we would",
        "we'd've":"we would have",
        "we'll":"we will",
        "we're":"we are",
        "weren't":"were not",
        "we've":"we have",
        "what'd":"what did",
        "what'll":"what will",
        "what're":"what are",
        "what's":"what is",
        "what've":"what have",
        "when's":"when is",
        "where'd":"where did",
        "where're":"where are",
        "where's":"where is",
        "where've":"where have",
        "which's":"which is",
        "who'd":"who would",
        "who'd've":"who would have",
        "who'll":"who will",
        "who're":"who are",
        "who's":"who is",
        "who've":"who have",
        "why'd":"why did",
        "why're":"why are",
        "why's":"why is",
        "won't":"will not",
        "wouldn't":"would not",
        "would've":"would have",
        "y'all":"you all",
        "you'd":"you would",
        "you'll":"you will",
        "you're":"you are",
        "you've":"you have",
        "Whatcha":"What are you",
        "luv":"love",
        "sux":"sucks",
        "couldn't":"could not",
        "wouldn't":"would not",
        "shouldn't":"should not",
        "im":"i am"
        }

Now we'll clean the text to get rid of punctuation and links.

In [None]:
import re  ##check if a particular string matches a given regular expression
import string

## funtion to replace the short forms 
def normalization(data):
    data = str(data).lower()
    # Take out all URLs
    data = re.sub('((www.[^\s]+)|(https?://[^\s]+))',' ',data)
    data = re.sub(r'#([^\s]+)', r'\1', data)

    # Number
    data = ''.join([i for i in data if not i.isdigit()])

    # Punctuation
    for sym in string.punctuation:
        data = data.replace(sym, " ")
    short_form = short_forms()
    data = data.replace("’","'")
    words = data.split()
    converted = [short_form[word] if word in short_form else word for word in words]
    data = " ".join(converted)
    return data

In [None]:
df

In [None]:
df['text']=df['text'].apply(normalization)

In [None]:
df['text'] = df['text'].apply(lambda x: x.lower())

In [None]:
df.head()

## Sentiment analysis

First we import a dictionary which contains the sentiment analysis words which will act as a reference for our data set words.

In [None]:
guide = pd.read_csv('LoughranMcDonald_MasterDictionary_2020.csv')
guide.head()

Assigning Positive and Negative score to our words based on the dictionary words

In [None]:
pos = [] 
neg =[]
Uncertain = []
for index,row in guide.iterrows():
    if row['Negative']>0:
        neg.append(row['Word'].lower())
    elif row['Positive']>0:
        pos.append(row['Word'].lower())
    elif row['Uncertainty']>0:
        Uncertain.append(row['Word'].lower())

In [None]:
df.head()

In [None]:
def positivescore(text):
    score = 0
    global pos
    words = text.split()
    for word in words:
        if word in pos:
            score +=1
    return score
    
def negativescore(text):
    score = 0
    global neg
    words = text.split()
    for word in words:
        if word in neg:
            score +=1
    return score

In [None]:
df['Positive Score']=df['text'].apply(positivescore)
df['Negative Score']=df['text'].apply(negativescore)

In [None]:
df.head()

Getting all other parameters

- **Subjectivity score** is just to identify if a text is written from a more factual basis or opinionated basis,
where 0 means it is factual whilst 1 means it is highly subjective (opinionated).
- **Polarity** represents the sentiment intensity, it measures the degree of positivty, neutrality and negativity. The polarity score can range from -1 to 1, with -1 indicating strong negative sentiment, 0 indicating neutral sentiment, and 1 indicating strong positive sentiment.

In [None]:
df['WORD COUNT']=df['text'].apply(lambda x:len(x.split()))
df['POLARITY SCORE']=(df['Positive Score']-df['Negative Score'])/ ((df['Positive Score'] + df['Negative Score']) + 0.000001)
df['SUBJECTIVITY SCORE']=(df['Positive Score'] + df['Negative Score'])/ ((df['WORD COUNT']) + 0.000001)
df['AVG SENTENCE LENGTH']=df['WORD COUNT']/df['Number of sentences']
df['AVG NUMBER OF WORDS PER SENTENCE'] = df['WORD COUNT']/df['Number of sentences']

In [None]:
df.head()

In [None]:
## for avg length of words
def avgwordlength(text):
    words = text.split()
    no_of_words=len(words)
    total_char=0
    for word in words:
        total_char+=len(word)
    return total_char/no_of_words

In [None]:
## for seeing if the sentence has pronoun
def pronoun(text):
    pronouns = r"(\b(s?i|me|we|my|ours|us|I|Me|We|My|Ours|Us)\b)"
    result = 0

    matches = re.finditer(pronouns,text,re.MULTILINE)
    for nummatch,match in enumerate(matches):
        result+=1
    return result

In [None]:
df['AVG WORD LENGTH']=df['text'].apply(avgwordlength)
df['PERSONAL PRONOUNS']=df['text'].apply(pronoun)
df.columns

In [None]:
df.head()

## Conclusion

On looking at the data we find that every article has a factual tone, so we'll not include subjectivity score in final judgement.
Using polarity score, 

In [None]:
df['POLARITY SCORE'].median()

Now we see the median polarity score of all articles, we infer that with a score of **-0.17647058304498286** these data science articles have a **pretty neutral** sentiment.