<a href="https://colab.research.google.com/github/rushithakondreddy/haihua_INFO5731_Spring2020/blob/main/In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (2/23/2021, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [13]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
import nltk
from nltk.tokenize import word_tokenize

In [14]:
text = []
first_page = 'https://citeseerx.ist.psu.edu/search;jsessionid=37A0AC54277865A394D5F96748080D17?q=natural+language+processing&t=doc&sort=date'
for page_num in range(50):
  if page_num == 0:
    website_link = first_page
  else:
    website_link = 'https://citeseerx.ist.psu.edu/search?q=natural+language+processing&t=doc&sort=date&start=' + str (page_num)
  page = requests.get(website_link)
  soup = BeautifulSoup(page.text, 'html.parser')
  abstracts = soup.find_all(class_='pubabstract')
  page_num = page_num + 10
  for abstract in abstracts:
    processed_text = abstract.text.replace('\n', '').strip()
    text.append(processed_text)
data_frame = pd.DataFrame((text), columns =['NLP Text'])
data_frame

Unnamed: 0,NLP Text
0,to process knowledge stored in distributed het...
1,Abstract: The Unified Modeling Language (UML) ...
2,-fluent approach to conflict means working ove...
3,the critical nature of CMP as described in the...
4,strength. Numerous studies are present that sh...
...,...
495,FD-buffer:\t a\tbuffer\tmanager\tfor\tdatabase...
496,ABSTRACT The adsorption kinetics of pure N 2 O...
497,Abstract Despite low attention level in Wester...
498,Abstract Networked learning happens naturally ...


In [15]:
data_frame['Special Characters Removal'] = data_frame['NLP Text'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', i) for i in x ))
#Removal of StopWords
stop = stopwords.words('english')
data_frame['removed stopwords'] = data_frame['NLP Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
print(data_frame)
#Stemming Data
st = PorterStemmer()
data_frame['Stemming'] = data_frame['removed stopwords'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
#Lemmatization Data
nltk.download('wordnet')
data_frame['Lemmatization'] = data_frame['NLP Text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
data_frame

                                              NLP Text  ...                                  removed stopwords
0    to process knowledge stored in distributed het...  ...  process knowledge stored distributed heterogen...
1    Abstract: The Unified Modeling Language (UML) ...  ...  Abstract: The Unified Modeling Language (UML) ...
2    -fluent approach to conflict means working ove...  ...  -fluent approach conflict means working time u...
3    the critical nature of CMP as described in the...  ...  critical nature CMP described brief overview a...
4    strength. Numerous studies are present that sh...  ...  strength. Numerous studies present show proces...
..                                                 ...  ...                                                ...
495  FD-buffer:\t a\tbuffer\tmanager\tfor\tdatabase...  ...   FD-buffer: buffer manager databases flash disks.
496  ABSTRACT The adsorption kinetics of pure N 2 O...  ...  ABSTRACT The adsorption kinetics pure N 2 O na...
4

Unnamed: 0,NLP Text,Special Characters Removal,removed stopwords,Stemming,Lemmatization
0,to process knowledge stored in distributed het...,to process knowledge stored in distributed het...,process knowledge stored distributed heterogen...,process knowledg store distribut heterogen sou...,to process knowledge stored in distributed het...
1,Abstract: The Unified Modeling Language (UML) ...,Abstract The Unified Modeling Language UML ...,Abstract: The Unified Modeling Language (UML) ...,abstract: the unifi model languag (uml) wide u...,Abstract: The Unified Modeling Language (UML) ...
2,-fluent approach to conflict means working ove...,fluent approach to conflict means working ove...,-fluent approach conflict means working time u...,-fluent approach conflict mean work time under...,-fluent approach to conflict mean working over...
3,the critical nature of CMP as described in the...,the critical nature of CMP as described in the...,critical nature CMP described brief overview a...,"critic natur cmp describ brief overview above,...",the critical nature of CMP a described in the ...
4,strength. Numerous studies are present that sh...,strength Numerous studies are present that sh...,strength. Numerous studies present show proces...,strength. numer studi present show process com...,strength. Numerous study are present that show...
...,...,...,...,...,...
495,FD-buffer:\t a\tbuffer\tmanager\tfor\tdatabase...,FD buffer a buffer manager for databases on ...,FD-buffer: buffer manager databases flash disks.,fd-buffer: buffer manag databas flash disks.,FD-buffer: a buffer manager for database on fl...
496,ABSTRACT The adsorption kinetics of pure N 2 O...,ABSTRACT The adsorption kinetics of pure N 2 O...,ABSTRACT The adsorption kinetics pure N 2 O na...,abstract the adsorpt kinet pure N 2 O natur ze...,ABSTRACT The adsorption kinetics of pure N 2 O...
497,Abstract Despite low attention level in Wester...,Abstract Despite low attention level in Wester...,Abstract Despite low attention level Western m...,abstract despit low attent level western media...,Abstract Despite low attention level in Wester...
498,Abstract Networked learning happens naturally ...,Abstract Networked learning happens naturally ...,Abstract Networked learning happens naturally ...,abstract network learn happen natur within soc...,Abstract Networked learning happens naturally ...


In [16]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package brown to /root/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package brown_tei to /root/nltk_data...
       |   Unzipping corpora/brown_tei.zip.
       | Downloading package cess_cat to /root/nltk_data...
       |   Unzipping corpora/cess_cat.zip.
       | Downloading package

True

In [17]:
parts_of_speech = []
for sentence in data_frame['NLP Text']:
  parts_of_speech.append(nltk.pos_tag(word_tokenize(sentence)))

In [18]:
dict_counts = {}
Final_dict = {}
Final_dict['Cordinating Conjuction'] = []
Final_dict['Determiner'] = []
Final_dict['Modal'] = []
Final_dict['Preposition'] = []
Final_dict['Adjective'] = []
Final_dict['Proper Noun'] = []
Final_dict['Verb'] = []
Final_dict['Possessive Ending'] = []
Final_dict['Possesive Pronoun'] = []
for i in parts_of_speech:
  dict_counts['CC'] = dict_counts['DT'] = dict_counts['MD'] = dict_counts['IN'] = dict_counts['JJS'] = dict_counts['NNP'] = dict_counts['VB'] = dict_counts['POS'] = dict_counts['PRP$'] = 0
  for j in i:
    if j[1] == 'CC':
      dict_counts['CC'] += 1
    elif j[1] == 'DT':
      dict_counts['DT'] += 1
    elif j[1] == 'MD':
      dict_counts['MD'] += 1
    elif j[1] == 'IN':
      dict_counts['IN'] += 1
    elif j[1] == 'JJS':
      dict_counts['JJS'] += 1
    elif j[1] == 'NNP':
      dict_counts['NNP'] += 1
    elif j[1] == 'VB':
      dict_counts['VB'] += 1
    elif j[1] == 'POS':
      dict_counts['POS'] += 1
    elif j[1] == 'PRP$':
      dict_counts['PRP$'] += 1
  Final_dict['Cordinating Conjuction'].append(dict_counts['CC'])
  Final_dict['Determiner'].append(dict_counts['DT'])
  Final_dict['Modal'].append(dict_counts['MD'])
  Final_dict['Preposition'].append(dict_counts['IN'])
  Final_dict['Adjective'].append(dict_counts['JJS'])
  Final_dict['Proper Noun'].append(dict_counts['NNP'])
  Final_dict['Verb'].append(dict_counts['VB'])
  Final_dict['Possessive Ending'].append(dict_counts['POS'])
  Final_dict['Possesive Pronoun'].append(dict_counts['PRP$'])

In [19]:
features_dataframe = pd.DataFrame.from_dict(Final_dict)
print(features_dataframe)

     Cordinating Conjuction  Determiner  ...  Possessive Ending  Possesive Pronoun
0                         1           3  ...                  0                  0
1                         1           6  ...                  0                  0
2                         2           3  ...                  0                  1
3                         0           6  ...                  0                  0
4                         1           3  ...                  0                  0
..                      ...         ...  ...                ...                ...
495                       0           1  ...                  0                  0
496                       2           6  ...                  0                  0
497                       1           2  ...                  0                  0
498                       0           4  ...                  0                  0
499                       1           7  ...                  0                  0

[50

In [20]:
features_dataframe.insert (0, "Text", data_frame['NLP Text'])
features_dataframe['Sentence Length'] = features_dataframe['Text'].apply(lambda x: len(x))
features_dataframe

Unnamed: 0,Text,Cordinating Conjuction,Determiner,Modal,Preposition,Adjective,Proper Noun,Verb,Possessive Ending,Possesive Pronoun,Sentence Length
0,to process knowledge stored in distributed het...,1,3,1,5,0,0,3,0,0,305
1,Abstract: The Unified Modeling Language (UML) ...,1,6,0,5,0,8,2,0,0,297
2,-fluent approach to conflict means working ove...,2,3,0,5,0,2,3,0,1,311
3,the critical nature of CMP as described in the...,0,6,0,9,0,3,1,0,0,292
4,strength. Numerous studies are present that sh...,1,3,0,8,0,0,1,0,0,303
...,...,...,...,...,...,...,...,...,...,...,...
495,FD-buffer:\t a\tbuffer\tmanager\tfor\tdatabase...,0,1,0,2,0,0,0,0,0,58
496,ABSTRACT The adsorption kinetics of pure N 2 O...,2,6,0,8,0,10,0,0,0,293
497,Abstract Despite low attention level in Wester...,1,2,0,6,0,2,1,0,0,291
498,Abstract Networked learning happens naturally ...,0,4,2,6,0,2,4,0,0,296
