## **TEXT ANALYSIS OF AN WEBSITE DATA**

Computing - 

'URL_ID',

'URL',

'POSITIVE SCORE',

'NEGATIVE SCORE',

'POLARITY SCORE',

'SUBJECTIVITY SCORE',

'AVG SENTENCE LENGTH',

'PERCENTAGE OF COMPLEX WORDS',

'FOG INDEX',

'AVG NUMBER OF WORDS PER SENTENCE',

'COMPLEX WORD COUNT',

'WORD COUNT',

'SYLLABLE PER WORD',

'PERSONAL PRONOUNS',

'AVG WORD LENGTH'

In [None]:
from google.colab import files

import xlrd

from bs4 import BeautifulSoup
import json
import numpy as np
import pandas as pd

import requests
from requests.models import MissingSchema
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import re

import csv
import pandas as pd

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
workbook = xlrd.open_workbook('Input.xlsx')

In [None]:
worksheet = workbook.sheet_by_index(0)

In [None]:
urls = [[worksheet.cell_value(i,0), worksheet.cell_value(i,1)] 
        for i in range(1,worksheet.nrows)]

In [None]:
urls[2]

[3.0,
 'https://insights.blackcoffer.com/ai-and-its-impact-on-the-fashion-industry/']

In [None]:
def extract_text_from_web(url):
    headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"}
    resp = requests.get(url[1],headers=headers)
    soup = BeautifulSoup(resp.content,'html.parser')
    title = soup.title.get_text()
    text = '' 
    for data in soup.find_all('p'):
        text += f"{data} "
    
    article = title + " " + text
    article = article.replace('<p>',"")
    article = article.replace('</p>',"")

    f = open(f"{url[0]}.txt","w")
    for line in article:
        f.writelines(line)
    f.close()
    return article

In [None]:
texts = [extract_text_from_web(url) for url in urls]

## Cleaning using Stop Words Lists

The Stop Words Lists (found in the folder StopWords) are used to clean the text so that Sentiment Analysis can be performed by excluding the words found in Stop Words List.


In [None]:
def word_list_maker(filename):
    words = []
    file = open(file = filename,
                mode = 'r', 
                errors= 'replace')
    words += file.readlines()
    words = [word.split().pop(0).strip() for word in words]
    file.close()
    return words

In [None]:
stop_word_list = []
stop_word_filenames = ['StopWords_Auditor.txt',
                       'StopWords_Currencies.txt',
                       'StopWords_DatesandNumbers.txt',
                       'StopWords_Generic.txt',
                       'StopWords_GenericLong.txt',
                       'StopWords_Geographic.txt',
                       'StopWords_Names.txt',
                       ]

for filename in stop_word_filenames:
    stop_word_list += word_list_maker(filename)

In [None]:
def remove_stop_words(text):
    for word in text:
        if word in stop_word_list:
            text.replace(word, "")
    return text

In [None]:
stop_word_removed_text = [remove_stop_words(text) for text in texts]

In [None]:
stop_word_removed_text[0]

'How is Login Logout Time Tracking for Employees in Office done by AI? - Blackcoffer Insights When people hear AI they often think about sentient robots and magic boxes. AI today is much more mundane and simple—but that doesn’t mean it’s not powerful. Another misconception is that high-profile research projects can be applied directly to any business situation. AI done right can create an extreme return on investments (ROIs)—for instance through automation or precise prediction. But it does take thought, time, and proper implementation. We have seen that success and value generated by AI projects are increased when there is a grounded understanding and expectation of what the technology can deliver from the C-suite down. “Artificial Intelligence (AI) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason and take action.”3 Lately there has been a 

## Creating a dictionary of Positive and Negative words

The Master Dictionary (found in the folder MasterDictionary) is used for creating a dictionary of Positive and Negative words. We add only those words in the dictionary if they are not found in the Stop Words Lists.


In [None]:
positive_words = word_list_maker('positive-words.txt')
negative_words = word_list_maker('negative-words.txt')

In [None]:
pn_dict = {}
pn_dict['positive'] = [word for word in positive_words if word not in stop_word_list] 
pn_dict['negative'] = [word for word in negative_words if word not in stop_word_list]

In [None]:
pn_dict

{'negative': ['2-faced',
  '2-faces',
  'abnormal',
  'abolish',
  'abominable',
  'abominably',
  'abominate',
  'abomination',
  'abort',
  'aborted',
  'aborts',
  'abrade',
  'abrasive',
  'abrupt',
  'abruptly',
  'abscond',
  'absence',
  'absent-minded',
  'absentee',
  'absurd',
  'absurdity',
  'absurdly',
  'absurdness',
  'abuse',
  'abused',
  'abuses',
  'abusive',
  'abysmal',
  'abysmally',
  'abyss',
  'accidental',
  'accost',
  'accursed',
  'accusation',
  'accusations',
  'accuse',
  'accuses',
  'accusing',
  'accusingly',
  'acerbate',
  'acerbic',
  'acerbically',
  'ache',
  'ached',
  'aches',
  'achey',
  'aching',
  'acrid',
  'acridly',
  'acridness',
  'acrimonious',
  'acrimoniously',
  'acrimony',
  'adamant',
  'adamantly',
  'addict',
  'addicted',
  'addicting',
  'addicts',
  'admonish',
  'admonisher',
  'admonishingly',
  'admonishment',
  'admonition',
  'adulterate',
  'adulterated',
  'adulteration',
  'adulterier',
  'adversarial',
  'adversary'

## Extracting Derived variables

We convert the text into a list of tokens using the nltk tokenize module and use these tokens to calculate the 4 variables described below:


### Positive Score: 
This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.

### Negative Score: 
This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number.

### Polarity Score: 
This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula: 

Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)
Range is from -1 to +1

### Subjectivity Score: 
This is the score that determines if a given text is objective or subjective. It is calculated by using the formula: 

Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)

Range is from 0 to +1


In [None]:
def get_key(val):
    for key, value in pn_dict.items():
        if val in value:
            return key

def syllable_count(word):
    count = 0
    vowels = 'aeiouy'
    word = word.lower()
    if word[0] in vowels:
        count +=1
    for index in range(1,len(word)):
        if word[index] in vowels and word[index-1] not in vowels:
            count +=1
    if word.endswith('e'):
        count -= 1
    if word.endswith('es'):
        count -= 1
    if word.endswith('ed'):
        count -= 1
    if word.endswith('le'):
        count += 1
    if count == 0:
        count += 1
    return count

In [None]:
positive_score = []
negative_score = []
num_of_words = []
num_of_syllable = []
num_of_complex_word = []
word_length = []

ignore = ['!','@','#','$','%','^','&','*','(',')','>','<']

for text in texts:
    p, n = 0, 0
    temp1 = []
    count = 0
    temp = word_tokenize(text)

    for items in temp:
        if items not in ignore:
           temp1.append(items)
    num_of_words.append(len(temp1))
    
    l = []
    length = []
    for word in temp1:
        if get_key(word) == 'positive':
            p += 1
        elif get_key(word) == 'negative':
            n -= 1

        l.append(syllable_count(word))
        length.append(len(word))

    num_of_syllable.append(l)
    positive_score.append(p)
    negative_score.append(-1*n)
    word_length.append(length)


for words in num_of_syllable:
    count = 0
    for length in words:
        if length > 2:
            count += 1
    num_of_complex_word.append(count)

num_of_words = np.array(num_of_words)
negative_score = np.array(negative_score)
positive_score = np.array(positive_score)
num_of_complex_word = np.array(num_of_complex_word)
word_length = np.array(word_length)

In [None]:
polarity_score = ((positive_score - negative_score)/(positive_score + negative_score + 0.000001))

In [None]:
polarity_score

array([ 0.18181817,  0.77777775,  0.58762886,  0.99999992,  0.77358489,
        0.50943395,  0.19047619,  0.52380952,  0.73913042, -0.0212766 ,
        0.74193546,  0.01754386,  0.19999998,  0.45454543, -0.2       ,
        0.34999999,  0.36538461,  0.19148936,  0.33333333,  0.47727272,
        0.38888888,  0.3030303 ,  0.36842104,  0.26732673,  0.47999999,
        0.28571428, -0.14285714,  0.43283581,  0.05882353,  0.44827586,
        0.51648351,  0.99999996,  0.390625  ,  0.99999994,  0.39999998,
       -0.49999988, -0.07894737, -0.49473684,  0.0625    ,  0.2       ,
        0.88679244,  0.49999988,  0.42857141,  0.5151515 ,  0.24999999,
        0.9999998 , -0.15254237,  0.63636358,  0.48837208, -0.71428561,
       -0.47058822, -0.2       ,  0.3125    ,  0.19354838,  0.54838708,
        0.72549018,  0.65714284,  0.52380952, -0.40206185,  0.46511627,
       -0.14285714,  0.        , -0.54166666,  0.55555554,  0.14285713,
        0.26530612,  0.26436781,  0.71794871, -0.09523809, -0.35

In [None]:
num_of_texts = np.array([len(text) for text in texts])
subjectivity_score = (positive_score + negative_score)/(num_of_texts + 0.000001)

In [None]:
print(f'''\t  Length of texts = {len(texts)}
          Length of negative_score = {len(negative_score)}
          Length of positive_score = {len(positive_score)}
          Length of polarity_score = {len(polarity_score)}
          Length of subjectivity_score = {len(subjectivity_score)}
          Length of complex word count = {len(num_of_complex_word)}
          Length of word_count = {len(num_of_words)}
      ''')

	  Length of texts = 170
          Length of negative_score = 170
          Length of positive_score = 170
          Length of polarity_score = 170
          Length of subjectivity_score = 170
          Length of complex word count = 170
          Length of word_count = 170
      


In [None]:
subjectivity_score 

array([0.00500227, 0.00663717, 0.00833548, 0.00459202, 0.01105779,
       0.00724836, 0.00932091, 0.00740045, 0.00816665, 0.01297626,
       0.00755177, 0.0071161 , 0.00374953, 0.00610264, 0.00978023,
       0.00721111, 0.00844361, 0.01107837, 0.00941029, 0.00907123,
       0.00677902, 0.00859823, 0.00815976, 0.01092601, 0.01186521,
       0.00848164, 0.00897436, 0.00835724, 0.00678643, 0.00888843,
       0.00926114, 0.00776583, 0.01198726, 0.0023549 , 0.00355556,
       0.00328138, 0.01033592, 0.00939106, 0.00720883, 0.0080982 ,
       0.01149425, 0.00149477, 0.00575895, 0.00937766, 0.00852575,
       0.00350877, 0.01232505, 0.00740242, 0.00893971, 0.00748663,
       0.01531532, 0.0070922 , 0.00627882, 0.00697805, 0.00630081,
       0.00629863, 0.01281113, 0.00622407, 0.0162779 , 0.01305404,
       0.00787908, 0.00635104, 0.01593361, 0.00864346, 0.00780814,
       0.01010518, 0.01030317, 0.00360677, 0.00790068, 0.00912316,
       0.0042337 , 0.00623112, 0.00670411, 0.0069785 , 0.00955

### Analysis of Readability

Analysis of Readability is calculated using the Gunning Fox index formula described below.

Average Sentence Length = the number of words / the number of sentences

Percentage of Complex words = the number of complex words / the number of words 

Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)


In [None]:
num_of_sent = np.array([(len(sent_tokenize(text))) for text in texts])

In [None]:
avg_sent_len = num_of_words/num_of_sent
percent_of_complex_word = num_of_complex_word / num_of_words * 100
fog_index = 0.4*(avg_sent_len + percent_of_complex_word)

In [None]:
fog_index

array([18.04612106, 16.58624263, 17.90710711, 19.78794711, 15.42952183,
       18.98262939, 17.82767231, 17.35058824, 15.32364066, 13.73314974,
       15.67239967, 14.42465875, 14.38018253, 17.13930951, 12.29782917,
       17.34298437, 19.09330062, 13.00902558, 16.41116085, 13.30446643,
       15.82471305, 15.71453267, 13.09690993, 15.70173396, 13.20038792,
       18.10898423, 15.45807389, 17.18483541, 12.71044607, 16.75883424,
       15.0218133 , 17.86248399, 89.83230626, 24.59722559, 19.21126482,
       19.45050779, 16.11919709, 18.26569641, 17.35232533, 13.6103211 ,
       14.72864063, 17.2391034 , 16.67700301, 20.19018134, 12.4924663 ,
       17.81269841, 11.16452347, 15.4583444 , 17.87510549, 15.86959707,
       12.5823823 , 18.35314586, 15.86310063, 13.41081371, 11.17823322,
       14.20999018, 19.28773585, 17.79327741, 10.60352955, 10.4675827 ,
       14.6748266 , 14.92021858, 10.93363153, 16.07464929, 17.51782041,
       20.40512821, 15.66662216, 12.20060388, 14.72238825, 19.18

In [None]:
avg_word_length = [np.mean(item) for item in word_length]

In [None]:
num_of_words

array([ 782,  703, 1998,  479,  858, 1243, 1493, 1496, 1504,  664,  740,
       1471,  472,  635, 1590, 1033, 2010, 1642, 1921, 1809, 1928, 1422,
        845, 1703,  802, 2424, 2193, 1341, 1396, 1800, 1772,  568, 1904,
       1188, 1012,  211, 1265, 1729,  803, 2176,  809,  461,  917,  595,
        485,  224,  951,  251,  790,  182,  419, 1976, 1902, 1702,  979,
       1604,  530, 1811, 1162, 1310, 1140, 1220, 1176,  762,  327,  832,
       1498, 3852, 1915, 1640,  211,  992, 2030, 1965, 1195, 1925, 1129,
       1377, 1172, 1434,  254, 1225,  786, 1207, 1202,  180,  599, 1040,
        809, 1200, 1211, 1155,  920, 1841, 1627,  918, 1993, 1420,  432,
       2409, 1722, 1935, 2007, 1171, 2103,  943, 1492,  339,  586, 1959,
       1947, 1603,  770,  370,  380,  457,  298,  293,  204, 1036,  522,
       2032,  667, 1214,  789,  984, 1261, 1261,  622, 1145, 1114,  912,
       1026,  914, 1383,  877,  431,  858, 1723, 1644,  916, 1323, 1079,
        639, 1643, 1449,  767, 1525, 1742,  981,  5

In [None]:
avg_num_word_per_sent = num_of_words/num_of_sent

In [None]:
pronounRegex = re.compile(r'\b(I|we|my|ours|(?-i:us))\b',re.I)
personal_pronouns = [list(set(pronounRegex.findall(text))) for text in texts]

In [None]:
header = ['URL_ID',
          'URL',
          'POSITIVE SCORE',
          'NEGATIVE SCORE',
          'POLARITY SCORE',
          'SUBJECTIVITY SCORE',
          'AVG SENTENCE LENGTH',
          'PERCENTAGE OF COMPLEX WORDS',
          'FOG INDEX',
          'AVG NUMBER OF WORDS PER SENTENCE',
          'COMPLEX WORD COUNT',
          'WORD COUNT',
          'SYLLABLE PER WORD',
          'PERSONAL PRONOUNS',
          'AVG WORD LENGTH'
          ]

In [None]:
filename = 'Text_analysis_project.csv'

In [None]:
rows = []
for i in range(len(texts)):
    row = [urls[i][0] , 
           urls[i][1],
           positive_score[i],
           negative_score[i],
           polarity_score[i],
           subjectivity_score[i],
           avg_sent_len[i],
           percent_of_complex_word[i],
           fog_index[i],
           avg_num_word_per_sent[i],
           num_of_complex_word[i],
           num_of_words[i],
           num_of_syllable[i],
           personal_pronouns[i],
           avg_word_length[i]]
    rows.append(row)      

In [None]:
with open(filename, mode='w') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(header)
    csvwriter.writerows(rows)

In [None]:
df = pd.read_csv(filename)

In [None]:
df.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,1.0,https://insights.blackcoffer.com/how-is-login-...,13,9,0.181818,0.005002,32.583333,12.531969,18.046121,32.583333,98,782,"[1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, ...","['We', 'we']",4.703325
1,2.0,https://insights.blackcoffer.com/how-does-ai-h...,24,3,0.777778,0.006637,25.107143,16.358464,16.586243,25.107143,115,703,"[1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 3, 2, 1, 3, ...",['we'],4.880512
2,3.0,https://insights.blackcoffer.com/ai-and-its-im...,77,20,0.587629,0.008335,27.0,17.767768,17.907107,27.0,355,1998,"[1, 1, 1, 2, 1, 1, 2, 3, 1, 3, 2, 1, 1, 1, 1, ...","['We', 'we', 'us']",4.914915
3,4.0,https://insights.blackcoffer.com/how-do-deep-l...,13,0,1.0,0.004592,31.933333,17.536534,19.787947,31.933333,84,479,"[1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, ...",['we'],4.974948
4,5.0,https://insights.blackcoffer.com/how-artificia...,47,6,0.773585,0.011058,23.189189,15.384615,15.429522,23.189189,132,858,"[1, 4, 4, 1, 1, 1, 5, 2, 1, 1, 3, 2, 1, 1, 1, ...","['I', 'we', 'We', 'my', 'us']",4.634033
