## The fifth In-class-exercise (2/23/2021, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [10]:
# Please write your code here
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
import nltk
from nltk.tokenize import word_tokenize

In [11]:
text = []
first_page = 'https://citeseerx.ist.psu.edu/search;jsessionid=37A0AC54277865A394D5F96748080D17?q=natural+language+processing&t=doc&sort=date'
for page_num in range(50):
  if page_num == 0:
    website_link = first_page
  else:
    website_link = 'https://citeseerx.ist.psu.edu/search?q=natural+language+processing&t=doc&sort=date&start=' + str (page_num)
  page = requests.get(website_link)
  soup = BeautifulSoup(page.text, 'html.parser')
  abstracts = soup.find_all(class_='pubabstract')
  page_num = page_num + 10
  for abstract in abstracts:
    processed_text = abstract.text.replace('\n', '').strip()
    text.append(processed_text)
data_frame = pd.DataFrame((text), columns =['NLP Text'])
data_frame

Unnamed: 0,NLP Text
0,to process knowledge stored in distributed het...
1,Abstract: The Unified Modeling Language (UML) ...
2,-fluent approach to conflict means working ove...
3,the critical nature of CMP as described in the...
4,strength. Numerous studies are present that sh...
...,...
495,FD-buffer:\t a\tbuffer\tmanager\tfor\tdatabase...
496,ABSTRACT The adsorption kinetics of pure N 2 O...
497,Abstract Despite low attention level in Wester...
498,Abstract Networked learning happens naturally ...


In [None]:
data_frame['Special Characters Removal'] = data_frame['NLP Text'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', i) for i in x ))
#Removal of StopWords
stop = stopwords. words('english')
data_frame['removed stopwords'] = data_frame['NLP Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
print(data_frame)
#Stemming Data
st = PorterStemmer()
data_frame['Stemming'] = data_frame['removed stopwords'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
#Lemmatization Data
nltk.download('wordnet')
data_frame['Lemmatization'] = data_frame['NLP Text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
data_frame

In [None]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

Nothing to update.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

Commands:
  d) Download a package or collection     u) Update out of date packages
  l) List packages & collections          h) Help
  c) View & Modify Configuration          q) Quit

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

Download which package (l=list; x=cancel)?
Packages:
  [ ] abc................. Australian Broadcasting Comm

In [None]:
parts_of_speech = []
for sentence in data_frame['NLP Text']:
  parts_of_speech.append(nltk.pos_tag(word_tokenize(sentence)))

In [None]:
dict_counts = {}
Final_dict = {}
Final_dict['Cordinating Conjuction'] = []
Final_dict['Determiner'] = []
Final_dict['Modal'] = []
Final_dict['Preposition'] = []
Final_dict['Adjective'] = []
Final_dict['Proper Noun'] = []
Final_dict['Verb'] = []
Final_dict['Possessive Ending'] = []
Final_dict['Possesive Pronoun'] = []
for i in parts_of_speech:
  dict_counts['CC'] = dict_counts['DT'] = dict_counts['MD'] = dict_counts['IN'] = dict_counts['JJS'] = dict_counts['NNP'] = dict_counts['VB'] = dict_counts['POS'] = dict_counts['PRP$'] = 0
  for j in i:
    if j[1] == 'CC':
      dict_counts['CC'] += 1
    elif j[1] == 'DT':
      dict_counts['DT'] += 1
    elif j[1] == 'MD':
      dict_counts['MD'] += 1
    elif j[1] == 'IN':
      dict_counts['IN'] += 1
    elif j[1] == 'JJS':
      dict_counts['JJS'] += 1
    elif j[1] == 'NNP':
      dict_counts['NNP'] += 1
    elif j[1] == 'VB':
      dict_counts['VB'] += 1
    elif j[1] == 'POS':
      dict_counts['POS'] += 1
    elif j[1] == 'PRP$':
      dict_counts['PRP$'] += 1
  Final_dict['Cordinating Conjuction'].append(dict_counts['CC'])
  Final_dict['Determiner'].append(dict_counts['DT'])
  Final_dict['Modal'].append(dict_counts['MD'])
  Final_dict['Preposition'].append(dict_counts['IN'])
  Final_dict['Adjective'].append(dict_counts['JJS'])
  Final_dict['Proper Noun'].append(dict_counts['NNP'])
  Final_dict['Verb'].append(dict_counts['VB'])
  Final_dict['Possessive Ending'].append(dict_counts['POS'])
  Final_dict['Possesive Pronoun'].append(dict_counts['PRP$'])

In [None]:
features_dataframe = pd.DataFrame.from_dict(Final_dict)
print(features_dataframe)