## <font color='green'> Application of Random Forest and Boosted Trees to the Classification of Web Documents
* Introduction of ensemble methods in python: https://scikit-learn.org/stable/modules/ensemble.html#ensemble

In [None]:
import os
os.chdir('/Users/hj020/Desktop/2022/EconomicAnalytics-master/Python_/Data')

import numpy as np
import pandas as pd
import math

# Data Preparation: 20 news group data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer 

categories = ['alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space']

remove = ('headers', 'footers', 'quotes')

data_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove, shuffle=True, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove, shuffle=True, random_state=42)

Y_train, Y_test = data_train.target, data_test.target

X_train = data_train.data
X_test = data_test.data

vectorizer = TfidfVectorizer(stop_words='english')

X_train = vectorizer.fit_transform(X_train) 
X_test = vectorizer.transform(X_test)
n_features = X_train.shape[1]

In [None]:
data_test.target_names

### <font color='green'> 1) Random Forests
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

RF = RandomForestClassifier(n_estimators=1000, max_depth=None, max_features = math.floor(math.sqrt(n_features)), min_samples_split = 2)

RFres= RF.fit(X_train, Y_train)

print(RFres.score(X_test, Y_test))
print(classification_report(Y_test, RFres.predict(X_test)))

### <font color='green'> * Relative Influence Plot
* An example: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

#### <font color='green'> i) Extract 10 most important features (words) in terms of their importance scores

In [None]:
importances = RFres.feature_importances_ # importance scores for all the words used for classification

# Sort the importance scores in ascending order and select the bottom 10 items and then reverse the list
# This involves many steps because there is no option for a descending sort in np.sort (this may be done by -np.sort(-importances))
top10v = np.sort(importances)[-10:][::-1] 
top10n = np.argsort(importances)[-10:][::-1] # Original indices for the sorted scores

feature_names = np.asarray(vectorizer.get_feature_names()) # Words used for classification by the vectorizer

print(feature_names[top10n]) # Sort feature names by the sorted order above
print(top10v)

#### <font color='green'> ii) Plot a bar chart

In [None]:
import matplotlib.pyplot as plt

for f in range(10):
    print("%d. %s (%f)" % (f + 1, feature_names[top10n[f]], top10v[f]))

plt.figure(figsize=(15,7))
plt.title("Feature Importance Chart",  fontsize=20)
plt.bar(feature_names[top10n], top10v, color="r")
plt.xticks(fontsize= 15)
plt.show()

### <font color='green'> 2) Boosted Trees using Gradient Boosting
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
    - $B$ : n_estimators, default=100
    - $\lambda$ : learning_rate, default=0.1
    - The (maximum) number of splits in a tree: max_depth, default=3

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

GB = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.5, max_depth=2)

GBres = GB.fit(X_train, Y_train)

print(GBres.score(X_test, Y_test))
print(classification_report(Y_test, GBres.predict(X_test)))

### <font color='darkred'> HW8: Similarly to the figure 8.11 on p.324 of the textbook, make a plot which shows the relationships between the number of trees and out of sample classification accuracy for random forest and gradient boosting 
- You may use the GridsearchCV function for this exercise. In this case, set scoring = 'accuracy'
- You may also manually use a for-loop. In this case, use " .score()" function to compute out of sample classification accuracies
- You can use any set of values for the number of trees (e.g. n_estimators=[10, 50, 100, 500, 1000 ....]), but you may want to use a fine grid to produce a better plot
- Put number of trees on x-axis and classification accuracy on y-axis
- Set max_depth=None, max_features = math.floor(math.sqrt(n_features)), min_samples_split = 2 for random forest
- Set learning_rate=0.5, max_depth=2 for gradient boosting

## <font color='green'> Web Scraping
* Web scraping using BeautifulSoup: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
* Web scraping using Scrapy:https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/
* Web scrapers: https://www.scraperapi.com/blog/the-10-best-web-scraping-tools (including commercial scrapers)

In [None]:
import requests
from bs4 import BeautifulSoup

# Specify a url
quote_page = 'https://www.uark.edu/academics/majors.php'
    
response = requests.get(quote_page)

# Get the web page in html and convert it to a BeautifulSoup format
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
# Find the address of the part you want to scrap using "inspect" function in your web browser
# For this, the inspection function should be manually enabled in your brower
# Plug the address in "find"

name_box = soup.find('p', {'class' : 'bigCopy'})
name_box
name = name_box.text.strip() # strip() is used to remove spaces at the beginning and at the end of the string
print(name)

In [None]:
# vectorize the text
ark = [name] 
arkk=vectorizer.transform(ark)
print(arkk)

In [None]:
RFres.predict(arkk)