## Data Mining HW3 
## Name: Xiaofeng Cao

In [1]:
import requests
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
def get_census_data():
    cols = ['age', 'workclass', 'fnlwgt', 'education', 
            'education_num', 'marital_status', 'occupation',
            'relationship', 'race', 'sex', 'capital_gain',
            'capital_loss', 'hours_per_week', 'native_country', 
            'over_fifty_k']
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
    with requests.get(url, stream=True) as r:
        results = [l.decode().split(',') for l in r.iter_lines()]
    return pd.DataFrame(results, columns=cols)

In [3]:
table = get_census_data()

In [4]:
table.shape

(32562, 15)

In [5]:
table.over_fifty_k.value_counts(normalize = True)

 <=50K    0.75919
 >50K     0.24081
Name: over_fifty_k, dtype: float64

### Q1.  Three issues/strategies
One interesting issue could be use ‘over_fifty_k’ as the class variable to predict if a person’s income would be over fifty thousand dollars from the rest of variables. There are few steps prior than building a good machine learning model. 

- First of all, it is important for us to preprocess the data, including handling with missing and duplicate values, transforming datatypes (numeric, datetime, boolean and etc.). There are many categorical features in this dataset, such as 'workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', and 'sex'. To encode these categories, we could consider ordinal encoding or one-hot encoding. I will use one-hot encoding for these categories in this homework. Improving the data quality will also increase our model performance. 
- Second, understanding the story behind the data is also critical for us to build a good model. Exploratory data analysis is essential. With numeric data, we often want to understand how it's distributed, for example, age, capital_gain, capital_loss, and hours_per_week. We also like to understand the counts in each category for categorical data. 
- Furthermore, after an overall analysis on the dataset, to modeling a dataset we sometimes need to select features. In this case, feature selection is probably not very helpful since we have only 14 features. Class imbalance is an issue here since the there are more than approximately 76% of examples are making less than 50K (see above). 

## 1.0 Data preprocessing and one hot encoding

In [6]:
table.drop_duplicates(inplace=True)

In [7]:
table[table.isnull().any(1)].index

Int64Index([32561], dtype='int64')

In [8]:
table.drop(index=32561,inplace=True)

In [9]:
table.reset_index(drop=True, inplace=True)

In [10]:
obj_cols = ["workclass","education","marital_status","occupation", "relationship","race","sex","native_country"]

In [11]:
one_hot_dfs = []
for col in obj_cols:
    one_hot_dfs.append(pd.get_dummies(table[col],prefix=f"is_{col}",drop_first=True))

In [12]:
one_hot = pd.concat(one_hot_dfs, axis=1)

In [13]:
num_cols = [col for col in table.columns if col not in obj_cols][:-1]

In [14]:
for i in num_cols:
    table[i] =pd.to_numeric(table[i])

In [15]:
table.over_fifty_k = table.over_fifty_k.str.strip()

In [16]:
table.over_fifty_k = table.over_fifty_k.apply(lambda _: True if _ == '>50K'else False)


## 2.0 a decision tree classifier using the default parameters

In [18]:
y = table['over_fifty_k']

In [19]:
X = one_hot.join(table[num_cols])

In [20]:
X.shape

(32537, 100)

In [21]:
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size = 1/3, random_state = 1)

In [22]:
DT = DecisionTreeClassifier()
DT.fit(X_train,y_train)
pred_DT = DT.predict(X_test)

In [23]:
def get_confusion(y, y_true):
    matrix = confusion_matrix(y, y_true)
    df = pd.DataFrame(matrix,index=["negative_actual", "positive_actual"],
                     columns=["negative_predicted", "positive_predicted"])
    return df

### 2.a. What is the accuracy of the classifier on the test data? 

In [24]:
accuracy_score(y_test, pred_DT)

0.8165222201733358

### 2.b How many leaves are there in the generated tree?

In [25]:
DT.tree_.node_count

6437

In [26]:
feature_importance_df =pd.DataFrame(DT.feature_importances_,index=X.columns,columns=['GINI'])

In [27]:
feature_importance_df.sort_values(by='GINI',ascending=False)[:5]

Unnamed: 0,GINI
is_marital_status_ Married-civ-spouse,0.202556
fnlwgt,0.182205
education_num,0.118344
age,0.114872
capital_gain,0.102834


### 2.c. the confusion matrix

In [28]:
get_confusion(y_test,pred_DT)

Unnamed: 0,negative_predicted,positive_predicted
negative_actual,7191,1112
positive_actual,878,1665


## 3.0 use 10-fold cross validation

In [29]:
scores_a = cross_val_score(DT, X=X,y=y,cv = 10, scoring='accuracy')

In [30]:
scores_a

array([0.80885065, 0.81499693, 0.81161647, 0.81192379, 0.82267978,
       0.81837738, 0.82851875, 0.81837738, 0.83553643, 0.80596556])

In [31]:
scores_a.mean()

0.8176843106191525

Cross Validation improved accuracy from 0.81698 to 0.81941. Not a significant differfence.

## 4.0  modify one of the default decision tree parameters
In this case, I will change `class_weight=None` to `class_weight=balanced`

In [32]:
DT2 = DecisionTreeClassifier(class_weight='balanced')
DT2.fit(X_train,y_train)
pred_DT2 = DT2.predict(X_test)

### 4.a. What is the accuracy of the classifier on the test data? 

In [33]:
accuracy_score(y_test, pred_DT2)

0.8196570164115803

Accuracy improved from 0.81698 to 0.82132. There are some improvement (around 1%) from adding blanced class weight to the classifier. 

### 4.b How many leaves are there in the generated tree?

In [34]:
DT2.tree_.node_count

6641

### 4.c. the confusion matrix

In [35]:
get_confusion(y_test,pred_DT2)

Unnamed: 0,negative_predicted,positive_predicted
negative_actual,7263,1040
positive_actual,916,1627


Slightly change in confusion matrix but not obvious. 

## 5. Random Forest algorithm 

`n_estimators` is the number of trees to be used in the forest. In this case, I use 100.

In [36]:
RFC = RandomForestClassifier(n_estimators=20,class_weight='balanced_subsample')
RFC = RFC.fit(X_train, y_train)
pred_RFC = RFC.predict(X_test)

### 5.a. What is the accuracy of the classifier on the test data? 

In [39]:
accuracy_score(y_test, pred_RFC)

0.8557071731513922

### 5.b the confusion matrix

In [40]:
get_confusion(y_test,pred_RFC)

Unnamed: 0,negative_predicted,positive_predicted
negative_actual,7678,625
positive_actual,940,1603


Random Forest helps improving the accuracy from 82% to 85%. This is significant. It predicts more true negatives by reducing its prediction on false negative.

## 6. Change the class variable to `sex`

In [41]:
table= pd.concat([table,pd.get_dummies(table['sex'],prefix=f"is_sex",drop_first=True)],axis=1)

In [42]:
table.drop(columns=['sex'],inplace=True)

In [43]:
obj_cols2 = ["workclass","education","marital_status","occupation", "relationship","race","native_country"]
one_hot_dfs2 = []
for col in obj_cols2:
    one_hot_dfs2.append(pd.get_dummies(table[col],prefix=f"is_{col}",drop_first=True))
one_hot2 = pd.concat(one_hot_dfs2, axis=1)

In [44]:
X2 = pd.concat([one_hot2,table[num_cols],table['over_fifty_k']],axis=1)

In [45]:
y2 = table['is_sex_ Male']

In [46]:
X_train2, X_test2, y_train2,y_test2 = train_test_split(X2,y2,test_size = 1/3, random_state = 1)

In [47]:
DT3 = DecisionTreeClassifier()
DT3.fit(X_train2,y_train2)
pred_DT3 = DT3.predict(X_test2)

### 6.a. What is the accuracy of the classifier on the test data? 

In [48]:
accuracy_score(y_test2, pred_DT3)

0.8116356260372487

### 6.b How many leaves are there in the generated tree?

In [49]:
n_nodes = DT3.tree_.node_count

In [50]:
n_nodes

6045

### 6.c. the confusion matrix

In [51]:
con_mtx =get_confusion(y_test2,pred_DT3)
con_mtx

Unnamed: 0,negative_predicted,positive_predicted
negative_actual,2520,1015
positive_actual,1028,6283


### 6.d Based on the results in the confusion matrix, specify the number of females and males in the test set as counts (whole numbers) and as percentages.  **Note**: 1: Male, 0: Female

In [52]:
y_test2.value_counts()

1    7311
0    3535
Name: is_sex_ Male, dtype: int64

In [53]:
y_test2.value_counts(normalize=True)

1    0.674073
0    0.325927
Name: is_sex_ Male, dtype: float64

### 6.e If you had to build a simple classifier to always guess the most common sex, what would the accuracy of the classifier be?

In [54]:
con_mtx.iloc[1].sum()/(con_mtx.iloc[0].sum()+con_mtx.iloc[1].sum())

0.6740733911119307

### 6.f Carefully examine the induced decision tree, and the attributes involved, and state why the high accuracy results are almost completely meaningless. What is the problem? 

In [55]:
feature_importance_df =pd.DataFrame(DT3.feature_importances_,index=X2.columns,columns=['GINI'])

In [56]:
feature_importance_df.sort_values(by='GINI',ascending=False)[:5]

Unnamed: 0,GINI
is_relationship_ Wife,0.192129
is_marital_status_ Married-civ-spouse,0.18817
fnlwgt,0.146973
age,0.086627
hours_per_week,0.053214


There are 6037 nodes in this tree. It is very likely to be overfitted. We will need to prune the tree. Moreover, to look at ROC and precision and recall scores to see if the model is cafefully predicting positives.