## Random Forest Classification 
### Mai Yang

In [1]:
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Dataset: Lung Cancer Prediction

Source: https://www.kaggle.com/datasets/thedevastator/cancer-patients-and-air-pollution-a-new-link

The study was published in the journal Nature Medicine, and counted data from over 462,000 people in China who were focused for an average of six years. The contributors were divided into two groups: those lived in areas with high levels of air pollution and the other group lived in areas iwth low levels of air pollution.

Discription: Lung cander is the leading cause of cancer death all around the world, accounting for 1.59 millions deaths in 2018. As we know it, the majority of lung cancer cases lead to smoking, but exposure to air pollution and other attributes is also a risk factor. A new study has found that air pollution may be linked to an increased risk of lung cancer, even in nonsmokers. This study area includes four levels of attributes to lung cancers of having levels of either 0: low, 1: medium, or 2: high.


Variables/Columns

* Index
* Patient Id
* Age
* Gender
* Air Pollution
* Alcohol us
* Dust Allergy
* Occupational Hazards
* Genetic Risk
* Chronic Lung Disease
* Balanced Diet
* Obesity
* Smoking
* Passive Smoker
* Chest Pain
* Coughing of Blood
* Fatigue
* Weight Loss
* Shortness of Breath
* Wheezing
* Swallowing Difficulty
* Clubbing of Finger Nails
* Frequent Cold
* Dry Cough
* Snoring
* Levels:
    * 0: Low
    * 1: Medium
    * 2: High
   
    




In [2]:
# Load and read the dataset to a dataframe
df = pd.read_csv('Resource/cancer patient data sets.csv')
df.head()

Unnamed: 0,index,Patient Id,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,...,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level
0,0,P1,33,1,2,4,5,4,3,2,...,3,4,2,2,3,1,2,3,4,Low
1,1,P10,17,1,3,1,5,3,4,2,...,1,3,7,8,6,2,1,7,2,Medium
2,2,P100,35,1,4,5,6,5,5,4,...,8,7,9,2,1,4,6,7,2,High
3,3,P1000,37,1,7,7,7,7,6,7,...,4,2,3,1,4,5,6,7,5,High
4,4,P101,46,1,6,8,7,7,7,6,...,3,2,4,1,4,2,4,2,3,High


In [3]:
# Drop index and patient id columns since it is not neccesary to have
# df.drop('column_name', axis=1); this is for dropping 1 column
df = df.drop(columns=['index', 'Patient Id'])

In [4]:
# Set the train and test datasets, and drop the Level column since we are trying to predict from the input set
X = df.drop('Level', axis=1)
y = df['Level']
target_names = ["Low", "Medium", "High"]

In [10]:
df.head(26)

Unnamed: 0,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,Balanced Diet,Obesity,...,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level
0,33,1,2,4,5,4,3,2,2,4,...,3,4,2,2,3,1,2,3,4,Low
1,17,1,3,1,5,3,4,2,2,2,...,1,3,7,8,6,2,1,7,2,Medium
2,35,1,4,5,6,5,5,4,6,7,...,8,7,9,2,1,4,6,7,2,High
3,37,1,7,7,7,7,6,7,7,7,...,4,2,3,1,4,5,6,7,5,High
4,46,1,6,8,7,7,7,6,7,7,...,3,2,4,1,4,2,4,2,3,High
5,35,1,4,5,6,5,5,4,6,7,...,8,7,9,2,1,4,6,7,2,High
6,52,2,2,4,5,4,3,2,2,4,...,3,4,2,2,3,1,2,3,4,Low
7,28,2,3,1,4,3,2,3,4,3,...,3,2,2,4,2,2,3,4,3,Low
8,35,2,4,5,6,5,6,5,5,5,...,1,4,3,2,4,6,2,4,1,Medium
9,46,1,2,3,4,2,4,3,3,3,...,1,2,4,6,5,4,2,1,5,Medium


In [6]:
# Prepare the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [7]:
# Import a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

In [8]:
# Import a Random Trees Classifier
from sklearn.ensemble import ExtraTreesClassifier

In [9]:
# Fit a model, and then print a classification report
clf = RandomForestClassifier(random_state=1).fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=target_names))
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

              precision    recall  f1-score   support

         Low       1.00      1.00      1.00        94
      Medium       1.00      1.00      1.00        78
        High       1.00      1.00      1.00        78

    accuracy                           1.00       250
   macro avg       1.00      1.00      1.00       250
weighted avg       1.00      1.00      1.00       250

Training Score: 1.0
Testing Score: 1.0


### The classification report 
The report displays the f1-score for all three levels at 1.0 or 100%, and the training and testing scores at 1.0 or 100%, a perfect score. Therefore, the Random Forest Classifier is the best fit for this model.

In [47]:
# Generate a bar plot showing 

# df.plot.bar(x='Gender', y='Age', kind='bar', rot=0, figsize=(15, 10))

# df.plot.bar(x='', y='Age', rot=0, figsize=(10, 10))

# df.plot(subplots=True, figsize=(25, 15))

# plt.tight_layout()
# plt.show()

# ax = df[['Air Pollution', 'Alcohol use']].plot(kind='bar', title="Lung Cancer", figsize=(15, 10), legend=True, fontsize=12)
# ax.set_xlabel("Gender", fontsize=12)
# ax.set_ylabel("Age", fontsize=12)
# plt.show()



In [50]:
df.head(15)

Unnamed: 0,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,Balanced Diet,Obesity,...,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level
0,33,1,2,4,5,4,3,2,2,4,...,3,4,2,2,3,1,2,3,4,Low
1,17,1,3,1,5,3,4,2,2,2,...,1,3,7,8,6,2,1,7,2,Medium
2,35,1,4,5,6,5,5,4,6,7,...,8,7,9,2,1,4,6,7,2,High
3,37,1,7,7,7,7,6,7,7,7,...,4,2,3,1,4,5,6,7,5,High
4,46,1,6,8,7,7,7,6,7,7,...,3,2,4,1,4,2,4,2,3,High
5,35,1,4,5,6,5,5,4,6,7,...,8,7,9,2,1,4,6,7,2,High
6,52,2,2,4,5,4,3,2,2,4,...,3,4,2,2,3,1,2,3,4,Low
7,28,2,3,1,4,3,2,3,4,3,...,3,2,2,4,2,2,3,4,3,Low
8,35,2,4,5,6,5,6,5,5,5,...,1,4,3,2,4,6,2,4,1,Medium
9,46,1,2,3,4,2,4,3,3,3,...,1,2,4,6,5,4,2,1,5,Medium
