# Problem Set 4 - Dealing with noisy data

_Data Preparation Course at UCU, 2019_

### NB

__1) Which programming languages to use?__

We recommend to use Python for this task, but if you find working library alternatives for the algorithms we
use in this assignment in R, you are free to work with that as well.

__2) What libraries/packages to use?__

You are free to choose any appropriate libraries (good choice would be __pandas__, __numpy__,
__scicit-learn__).

__3) How to summarize my homework?__

The best way is to create an Jupyter/R notebook with code and explanations for each strategy. In case you
are not familiar with these tools, you can create a Python/R scripts and write explanations as comments.
However, we strongly recommend you to use Jupyter/R notebooks, as those are #1 tools in applied data
analysis nowadays.

__4) Useful links__

1. [Deaing with Noisy Data in Data Science.](https://medium.com/analytics-vidhya/dealing-with-noisy-data-in-data-science-e177a4e32621)
2. [Decision trees in Scikit-learn.](https://scikit-learn.org/stable/modules/tree.html)

## Tasks

In this homework you will investigate the impact of different types of noise on the accuracy of classification
model based on __<font color="black">[(Census Income dataset)](https://archive.ics.uci.edu/ml/datasets/Census+Income)</font>__. Noise is an unavoidable problem which affects all stages of Data Mining process, so it is extremely important to learn how to deal with the noise in the most appropriate way. 

### __1) Logistic regression.__

Similar to previous assignment, you’ll have to train multiple logistic regression models. We encourage you to use provided jupyter notebook with working template of logistic regression for Census dataset. Please remember that LR is not the main topic of this assignment, so do not bother yourself tuning your models. The purpose of this assignment is to investigate the negative impact of noise in your dataset on the accuracy of classification and learn basic methods of dealing with problems of this type.

Regarding missing values in the dataset - you need to impute them using __global most common substitution strategy__ from the previous assignment.

__Treat dataset you obtain after missing values imputation as an original one. All further
modification in this homework perform on this dataset, not on the one you have before missing
values imputation.__

__1.1.__ Train original logistic regression model provided in jupyter notebook. Save values of train and test
classification accuracy scores for future comparison.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.linear_model import LogisticRegression

In [9]:
column_names = ["age","workclass", "fnlwgt", "education", "education-num", "marital-status",
                   "occupation", "relationship","race","sex","capital-gain","capital-loss",
                   "hours-per-week","native-country","profit"]
                
df = pd.read_csv("adult.data", header=None, names=column_names, index_col=False)
test_df = pd.read_csv("adults.test", header=None, names=column_names, index_col=False, skiprows=1)
indexes=["attribute noise","target noise", "no noise"]
score_table = pd.DataFrame(columns=["1%","5%","10%","20%","30%","40%","50%"],index=indexes)

In [58]:
df.shape

(32561, 15)

In [10]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,profit
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [11]:
def prepare_data(dff, train=False):
    df=dff.copy()
    df[(df["workclass"] == " ?") | (df["occupation"] == " ?") | (df["native-country"] == " ?")] = df[(df["workclass"] == " ?") | (df["occupation"] == " ?") | (df["native-country"] == " ?") ].replace(" ?",np.nan)
    df.loc[df["capital-gain"] == 0, "capital-gain"] = df["capital-gain"].mean()
    df.loc[df["capital-loss"] == 0, "capital-loss"] = df["capital-loss"].mean()
    df.fillna(df.mean(),inplace=True)
    df.drop(["education","marital-status","fnlwgt"],axis=1,inplace=True)  
    if train:
        tmp = {" <=50K":0, " >50K":1}
    else:
        tmp = {" <=50K.":0, " >50K.":1}
    df["profit"] = df["profit"].map(lambda x: tmp[x])
    df.loc[df["native-country"].isin([' United-States', ' Canada', ' South']), "native-country"] = "North America"
    df.loc[df["native-country"].isin([' Mexico',' Dominican-Republic',' Guatemala',' Trinadad&Tobago', ' Columbia',' El-Salvador', ' Ecuador', ' Philippines',' Honduras',' Outlying-US(Guam-USVI-etc)', ' Haiti', ' Puerto-Rico', " Nicaragua", 'Nicaragua', ' Peru', ' Cuba',' Jamaica']), "native-country"] = "Latin America"
    df.loc[df["native-country"].isin([' Taiwan', ' Hong', ' Thailand',' Japan', ' Cambodia', ' India',' Iran',' Vietnam',' China', ' Laos']), "native-country"] = "Asia"
    df.loc[df["native-country"].isin([' Italy',' Greece',' Holand-Netherlands', ' Poland',' Germany',' England',' Ireland',' Hungary',' France', ' Yugoslavia',' Scotland', ' France',' Portugal', " Netherlands"]), "native-country"] = "Europe"
    df = pd.get_dummies(df)
    pr = df["profit"]
    df.drop(["profit"],axis=1,inplace=True)
    return df,pr

In [12]:
X, y = prepare_data(df,True)
X_tst, y_tst = prepare_data(test_df)

In [13]:
lr = LogisticRegression(solver="liblinear")
lr.fit(X,y)
print(lr.score(X_tst, y_tst))
score_table.loc["no noise",:] = lr.score(X_tst, y_tst) 

0.8522203795835637


### __2) [1pt] Misclassification noise.__

__2.1.__ Introduce misclassification in your dataset. Randomly flip $n\%$ of the target variable (‘y’) values. Try $n = (1, 5, 10, 20)$. Perform this process __only in train dataset__. Leave test dataset unchanged.

In [17]:
percents = [1,5,10,20,30,40,50]

In [18]:
def addNoise(df_col, perc, chng=None): # if none than negate
    indexes = df_col.sample(int(df_col.shape[0]*0.01*perc)).index
    if chng is None:
        chng = {val:0 for val in df_col.iloc[indexes]}
    return pd.concat([df_col.iloc[indexes].map(lambda x: chng[x]), df_col.iloc[~df_col.index.isin(indexes)]]).sort_index()

In [19]:
y_ar = list()
for n in [1,5,10,20,30,40,50]:
    y_ar.append(addNoise(y, n,{0:1,1:0}))

__2.2.__ For each $n$ train separate model. Record train and test accuracy for each of these models.

In [20]:
for inx,y_ in enumerate(y_ar):
    lr = LogisticRegression(solver="liblinear")
    lr.fit(X,y_)
    
    display(lr.score(X,y_), lr.score(X_tst,y_tst))
    score_table.loc["target noise",str(percents[inx])+"%"] = lr.score(X_tst,y_tst)    
    print("\n")

0.8463192162402875

0.8531416989128432





0.8165596879702712

0.8517904305632332





0.7785080310801266

0.8484736809778269





0.7038481619114892

0.8441741907745225





0.6372347286631246

0.8425158159818193





0.5721261632013759

0.8328112523800749





0.5112557968121372

0.4895890915791413





__2.3.__ What is the highest safe fraction (approximately) of misclassified examples? (by ‘safe’, we mean fraction
of misclassified examples with which difference of accuracies between original and misclassified model does
not exceed 0.01)

It seems to be that 40% of wrong y labels is the threshold, when test drops more than 0.01

### __3) [1pt] Attribute noise.__

__3.0.__ For $n = (1,5,10,20)$ create datasets with different levels of attribute noise.

__3.1.__ Introduce attribute noise to the __age__ column. Randomly negate $n\%$ of the values of this attribute.

In [21]:
datasets = [df.copy() for _ in range(7)]

In [22]:
for inx,ds in enumerate(datasets):
    ds["age"] = addNoise(ds["age"],percents[inx])

__3.2.__ Introduce attribute noise to the __education_num__ column. Randomly replace $n\%$ of the values of this attribute with random large numbers in range $[20,100]$.

In [23]:
for inx, ds in enumerate(datasets):
    ds["education-num"] = addNoise(ds["education-num"],percents[inx], {num:np.random.randint(20,100) for num in range(20)})

__3.3.__ Introduce attirute noise to the __race__ column. Randomly replace $n\%$ of the values of this attribute with
any other random race from the set of existing races.

In [24]:
chng_r = {race:np.random.choice([other_race for other_race in df["race"].unique() if other_race != race]) for race in df["race"].unique()}
for inx, ds in enumerate(datasets):
    ds["race"] = addNoise(ds["race"],percents[inx], chng_r)

__3.4.__ For each $n$ train separate model. Record train and test accuracy for each of these models.

In [25]:
for inx,ds in enumerate(datasets):
    X,y = prepare_data(ds,train=True)
    
    lr = LogisticRegression(solver="liblinear")
    lr.fit(X,y)
    
    print(lr.score(X,y), lr.score(X_tst,y_tst))
    score_table.loc["attribute noise",str(percents[inx])+"%"] = lr.score(X_tst,y_tst)

0.8439544240041768 0.8460782507217002
0.8421117287552594 0.8452183526810393
0.8435244617794294 0.8452183526810393
0.8444150978164061 0.8461396720103188
0.8431559227296459 0.8455254591241325
0.8439851355916588 0.8455868804127511
0.843831577654249 0.8458325655672256


__3.5.__ Quantify the degradation of the model after introducing each new level of noise to its attributes.

It doesn't(almost). I think it happens because of small feature correlation.

### __4) [1pt] Impact comparison.__

__4.1.__ Build a table to compare accuracy of the model on the original dataset with models based on datasets
with different types and levels of noise introduced.

In [26]:
score_table

Unnamed: 0,1%,5%,10%,20%,30%,40%,50%
attribute noise,0.846078,0.845218,0.845218,0.84614,0.845525,0.845587,0.845833
target noise,0.853142,0.85179,0.848474,0.844174,0.842516,0.832811,0.489589
no noise,0.85222,0.85222,0.85222,0.85222,0.85222,0.85222,0.85222


__4.2.__ What has greater impact on the accuracy of the model: class or attribute noise? How would you explain
it? (4-5 sentences).

Class influences model accuracy more, because features can either be correlated with target or not(have a big impact or small), but target obviously correlated with itself so changing parameters should ruin the pattern, that model need to learn.

__4.3.__ What kind of noise would you address first? Why? (2-3 sentences)

I'd first try to take care of class noise, if it is possible cause spoils the results more.

### __5) [2pt] Misclassification noise elimination.__

__5.1.__ Use training dataset with 10% of misclassified instances which you obtained in __Task 2__.

In [35]:
train_y = y_ar[2]
train_X = X

In [39]:
train_y.index

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            32551, 32552, 32553, 32554, 32555, 32556, 32557, 32558, 32559,
            32560],
           dtype='int64', length=32561)

__5.2.__ Apply Cross-Validated Committees Filter algorithm to identify and fix mislabled instances in this dataset.

• You can read the full description of this algorithm in “Data Preprocessing In Data Mining” by S. Garcia,
J. Luengo and F. Herrera [page 117, Section 5.3.2].

• Use scikit-learn utilities to create and train Decision Tree classifiers for this algorithm. You can read
more about them in [__(Decision Trees)__](https://scikit-learn.org/stable/modules/tree.html).

• Use $\Gamma = 5$.

In [61]:
from sklearn.model_selection import KFold, train_test_split
from sklearn.tree import DecisionTreeClassifier
kf = KFold(n_splits=5,shuffle=False)

In [43]:
models = []
for train,test in kf.split(train_X):
    tr_X, ts_X = train_X.iloc[train,:], train_X.iloc[test,:]
    tr_y, ts_y = train_y[train], train_y[test]
    
    dt = DecisionTreeClassifier()
    dt.fit(tr_X,tr_y)
    
    models.append(dt)

In [57]:
votes = {inx:0 for inx in train_y.index}
for model in models:
#     pred = pd.Series(model.predict(train_X), index=train_y.index)
    diff = pd.concat([pred,train_y]).drop_duplicates(keep=False)
    
    for i in train_y.index:
        if train_y[i] != pred[i]:
            votes[i] += 1

diff = {inx:votes[inx] for inx in votes.keys() if votes[inx] > 2}
print(len(diff))

2953


__5.3.__ What percent of mislabled records you fixed using this method? Is it possible to do better?

I did manages to find 2953 of wrong data out of 3256 possible, though I'm not sure all of them were classified correctly.

__5.4.__ Compare the accuracy of the classifier after elimination of mislabeled instances with its accuracy before
this procedure was performed.

In [59]:
train_X.drop([i for i in diff.keys()], inplace=True)
train_y.drop([i for i in diff.keys()], inplace=True)

In [62]:
tr_X, ts_X = train_test_split(train_X, test_size=0.2,shuffle=False)
tr_y, ts_y = train_test_split(train_y, test_size=0.2,shuffle=False)

lg = LogisticRegression(solver="liblinear")
lg.fit(tr_X, tr_y)
print(lg.score(ts_X, ts_y))

0.7904424181019926


It seems like it is even worse now.