## Glass Classification--KNN

I will write this note as per CRISP-DM (cross-industry standard process for data mining) process. I will try to keep writing in English for code and algorithm related statements, the other sections, i may write it in Chinese for better elaboration, sorry for inconvenience if you are not Chinese-speaking friends.<br>
#### CRISP-DM Process:<br>

 - Business understanding
 - Data understanding
 - Data preparation
 - Modeling
 - Evaluation
 - Deployment

### Data set source
https://archive.ics.uci.edu/ml/datasets/Glass+Identification <br>
no missing values as per source descreption

### Business understanding

Data set  has 9 properties, the first one is optical property, the rest 8 is chemical oxide contents of glass, so property 1 and property 2-9 has different dimension.<br>
unit measurement for contents: weight percent in corresponding oxide, as are attributes
glass class has 7 types, but type 4 is not in this data set.<br>

----------
The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence...if it is correctly identified!

---

Names in Chinese:<br>
1. building_windows_float_processed --建筑窗户-浮法玻璃 <br>
2. building_windows_non_float_processed --建筑窗户-非浮法玻璃 <br>
3. vehicle_windows_float_processed -- 车窗-浮法玻璃<br>
4. vehicle_windows_non_float_processed (none in this database) -- 车窗-非浮法玻璃<br>
5. containers -- 器皿<br>
6. tableware -- 餐具<br>
7. headlamps--车前照灯<br>


----------
Main ingredient of glass: SiO2, other contents are also oxides, such as: Na2O, CaO,K2O

### Data understanding

In [None]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.style.use('seaborn')

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
path='../input/glass.csv'

df=pd.read_csv(path)
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
type=df['Type'].groupby(df['Type']).count()
type

### Here, we can see class data is skewed, class 1 and 2 is in the majority

In [None]:
type.plot('bar')

### Data preparation--Do not transform intentionally<br>
For comparison, we don't make any change the initial data set<br>
Note: no missing values, so data cleaning can be ignored.

we can see precision is approx. 72%. Best K-value is 1 that means classify by nearest 1 sample neighbor.<br>
But, we got a warning "The least populated class in y has only 8 members, which is too few. The minimum number of groups for any class cannot be less than n_splits=10." This is caused by low sample volume and skewness of data set.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# get column titles except the last column
features=df.columns[:-1].tolist()

# get data set features
X=df[features].values
# get labels
y=df['Type'].values

# split data to train data set and test data set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
# store scores of KNN model by K=1 to 31
scores=[]

# loop k from 1 to 31, and get cross validation score of each K value
for k in range(1,32):
    knn=KNeighborsClassifier(k)
    score_val=cross_val_score(knn,X_train,y_train,scoring='accuracy',cv=10)
    score_mean=score_val.mean()
    scores.append(score_mean)

# get index of maxium score along axis, default axis=0 for 1 dimensional array
best_k=np.argmax(scores)+1
print(best_k)
# generate KNN model
knn=KNeighborsClassifier(best_k)
# fit with train data set
knn.fit(X_train,y_train)
# get Modes presicion rate using test set
print("prediction precision rate:",knn.score(X_test,y_test))

### Data preparation--Do not transform intentionally<br>
### Now we try to solve below problems:<br>
- low data set volume
- skewness

### Get balanced sample by oversampling

In [None]:
df3=df[df['Type']==3]

In [None]:
df3=pd.concat([df3]*4)

In [None]:
df5=df[df['Type']==5]

In [None]:
df5=pd.concat([df5]*5)

In [None]:
df6=df[df['Type']==6]

In [None]:
df6=pd.concat([df6]*7)

In [None]:
df7=df[df['Type']==7]

In [None]:
df7=pd.concat([df7]*2)

In [None]:
df1=df[df['Type']==1]

In [None]:
df2=df[df['Type']==2]

In [None]:
df_balanced=pd.concat([df1,df2,df3,df5,df6,df7])

In [None]:
df_balanced.shape

In [None]:
df.head()

In [None]:
type=df_balanced['Type'].groupby(df_balanced['Type']).count()
type

In [None]:
type.plot('bar')

### Now we try to model again

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# df.columns is column labels property
features=df_balanced.columns[:-1].tolist()
X=df_balanced[features].values
y=df_balanced['Type']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
score=[]
for i in range(32):
    knn=KNeighborsClassifier(k)
    score_val=cross_val_score(knn,X_train,y_train,scoring='accuracy',cv=10)
    score_mean=score_val.mean()
    scores.append(score_mean)
best_K=np.argmax(scores)+1
print('best K is:',best_K)
knn=KNeighborsClassifier(best_K)
knn.fit(X_train,y_train)
print("prediction precision rate:",knn.score(X_test,y_test))


Now, you can see, no warning occurs, and precision rate grows to 90%

### Further data exploring
Usually data mining is a constantly trying and improvement action, we may go back and forth between data exploring and modeling, so now we will try to exposure more details of this data set

We use box plot to describe the value range of each features<br>
we find a fact, Silicon oxides is main contents of glass, its wight percent is much higher than other contents.<br>
for this kind of data, we usually normalize it to same scale for **possible** better model results<br>
Note: it is possible, not absolute, to improve the classifier performance

### Model again, but normalize feature values before train

In [None]:
df_balanced.iloc[:,:-1].boxplot()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing

# df.columns is column labels property
features=df_balanced.columns[:-1].tolist()
X=df_balanced[features].values
y=df_balanced['Type']

# normalization
min_max_scaler=preprocessing.MinMaxScaler()
X_minmax=min_max_scaler.fit_transform(X)

X_train,X_test,y_train,y_test=train_test_split(X_minmax,y,test_size=0.2,random_state=1)
score=[]
for i in range(32):
    knn=KNeighborsClassifier(k)
    score_val=cross_val_score(knn,X_train,y_train,scoring='accuracy',cv=10)
    score_mean=score_val.mean()
    scores.append(score_mean)
best_K=np.argmax(scores)+1
print('best K is:',best_K)
knn=KNeighborsClassifier(best_K)
knn.fit(X_train,y_train)
print("prediction precision rate:",knn.score(X_test,y_test))

We find normalization on this data set has no improvement on performance

In [None]:
X

In [None]:
X_minmax

### Dive into the data again--PCA-- Dimensionality Reduction
The prerequisity of PCA is hihg correlation among features, so we need to get correlation matrix before trying PCA

### 样本量较少是否可以做PCA主成分分析?<br>
主成分分析的目的是把高维数据降到低维度，同时尽可能多的解释样本方差，通常，我们需要的是在样本足够多的情况下找到使得累计贡献率达到85%以上的较少的主成分**。但是，在样本量较少的情况下，2-3个主成分的累计贡献率往往很大，简单而言就是很容易找到涵盖这些样本的特征；样本增加，多样性增加，累计贡献率就往往会下降。
另外，样本量较少的情况下，这个样本可能不能准确的反映总体的情况，因为样本容量太小的话，你很容易获得一组数据，他们“偶然”近似落在同一个平面上。比如在较小样本上把数据点投影到某个二维平面上能保有绝大部分方差，但当我搜集大量数据之后发现，许多点其实并不落在这个平面附近，反而离它很远。之前只是因为数据过少而造成的巧合。<br>
有一个基本原则：只要不超过计算能力的限制，在任何估计参数的时候，样本容量都是越大越好。中心极限定理已经证明了。
上面是对于样本量小的理解，另一个角度是维度小，即数据集的特征数并不多，即维度并不高，如果他们之间的相关性也不高，那么可以不进行PCA降维。<br>
那么对于这个数据集是否可以使用PCA, 答案是可以尝试，然后看下效果，效果变差了，放弃即可，不应该简单否定PCA的作用，因为这个数据集有两类属性，折射率和成分，且成分里还分硅氧化物和其他氧化物，玻璃的主要成分就是硅。<br>
[在主成分分析法中，是否对样本容量的多少有规定?][1]
  [1]: https://www.zhihu.com/question/20998460

In [None]:
df_balanced.head()

In [None]:
corr=df_balanced.iloc[:,:-1].corr()
corr

We can find correlation index is high, but not too much, only RI and Ca is 0.78.<br>
We will apply PCA method for learningintention.

In [None]:
from pandas.plotting import scatter_matrix
sm=scatter_matrix(df_balanced.iloc[:,:-1], alpha=1, figsize=(10, 10), diagonal='kde')



After testing, PCA action not always improve the accuracy of model.Here, I just precision rate=93.75%, but, please note, it is not a stable performance, if you change "test_size" or "random_state" values, performance will drop.
So this model has some risk of overfitting.
And, please be advised: the purpose of PCA or other dimensionality reduction action is not to improve performace, it is to reduce calculation volume, it is to extract essential components of sample, it may or may not improve performance.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn import decomposition

# df.columns is column labels property
features=df_balanced.columns[:-1].tolist()
X=df_balanced[features].values
y=df_balanced['Type']

# PCA
pca=decomposition.PCA(n_components=7)
pca.fit(X)
X=pca.transform(X)
print("Primary Components:",X)

# normalization
min_max_scaler=preprocessing.MinMaxScaler()
X_minmax=min_max_scaler.fit_transform(X)

X_train,X_test,y_train,y_test=train_test_split(X_minmax,y,test_size=0.2,random_state=1)
score=[]
for i in range(32):
    knn=KNeighborsClassifier(k)
    score_val=cross_val_score(knn,X_train,y_train,scoring='accuracy',cv=10)
    score_mean=score_val.mean()
    scores.append(score_mean)
best_K=np.argmax(scores)+1
print('best K is:',best_K)
knn=KNeighborsClassifier(best_K)
knn.fit(X_train,y_train)
print("prediction precision rate:",knn.score(X_test,y_test))
result=knn.predict(X_test)
print(result)
myarray = np.asarray(y_test.tolist())
print(myarray)

### what I am doing here?
because we used oversampling method to get balanced data set, so the test set must contains duplicated samples, this will affect scores of model, so i just remove duplicated samples and get scores again.

In [None]:
s=pd.DataFrame(X_test)
t=pd.DataFrame(y_test)
s.head()

In [None]:
t=t.reset_index()

In [None]:
t.head()

In [None]:
del t['index']

In [None]:
t.head()

In [None]:
X_test_u=pd.concat([s,t],axis=1)
X_test_u=X_test_u.drop_duplicates()
X_test_u.shape

In [None]:
X_test=X_test_u.iloc[:,:-1].values
y_test=X_test_u['Type']

In [None]:
print("prediction precision rate:",knn.score(X_test,y_test))
result=knn.predict(X_test)
print(result)
myarray = np.asarray(y_test.tolist())
print(myarray)