## About this notebook

In this notebook, I quickly explore the `biorxiv` subset of the papers. Since it is stored in JSON format, the structure is likely too complex to directly perform analysis. Thus, I not only explore the structure of those files, but I also provide the following helper functions for you to easily format inner dictionaries from each file:
* `format_name(author)`
* `format_affiliation(affiliation)`
* `format_authors(authors, with_affiliation=False)`
* `format_body(body_text)`
* `format_bib(bibs)`

Feel free to reuse those functions for your own purpose! If you do, please leave a link to this notebook.

Throughout the EDA, I show you how to use each of those files. At the end, I show you how to generate a clean version of the `biorxiv` as well as all the other datasets, which you can directly use by choosing this notebook as a data source ("File" -> "Add or upload data" -> "Kernel Output File" tab -> search the name of this notebook).

### Update Log

* V9: First release.
* V10: Updated paths to include the [14k new papers](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137474).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('../input/cord-19-eda-parse-json-and-generate-clean-csv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Biorxiv: Exploration

Let's first take a quick glance at the `biorxiv` subset of the data. We will also use this opportunity to load all of the json files into a list of **nested** dictionaries (each `dict` is an article).

In [None]:
import numpy as np #
import pandas as pd
import seaborn as sns #visualisation
import matplotlib.pyplot as plt 
%matplotlib inline
sns.set(color_codes=True)
from sklearn import datasets , linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr=LinearRegression(normalize=True)
from sklearn.metrics import accuracy_score

In [None]:
df=pd.read_csv('../input/cord-19-eda-parse-json-and-generate-clean-csv/biorxiv_clean.csv')

In [None]:
df1=pd.read_csv('../input/cord-19-eda-parse-json-and-generate-clean-csv/clean_comm_use.csv')

In [None]:
df2=pd.read_csv('../input/cord-19-eda-parse-json-and-generate-clean-csv/clean_noncomm_use.csv')

In [None]:
df3=pd.read_csv('../input/cord-19-eda-parse-json-and-generate-clean-csv/clean_pmc.csv')

In [None]:
df.head(10)

In [None]:
df1.head()

In [None]:
df2.head()

In [None]:
df3.head()

In [None]:
df.info()

In [None]:
df1.info()

In [None]:
df2.info()

In [None]:
df3.info()

In [None]:
df.shape

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
df3.shape

In [None]:
df.describe(include='all')

In [None]:
df1.describe(include ='all')

In [None]:
df2.describe(include ='all')

In [None]:
df3.describe(include ='all')

In [None]:
df = df.drop(['paper_id','affiliations','abstract','raw_bibliography','raw_authors'], axis=1) #these are the colomn to be drop which meant to no use to me .
df.head(5)

In [None]:
df1 = df1.drop(['paper_id','affiliations','abstract','raw_bibliography','raw_authors'], axis=1) #these are the colomn to be drop which meant to no use to me .
df.head(5)

In [None]:
df2 = df2.drop(['paper_id','affiliations','abstract','raw_bibliography','raw_authors'], axis=1) #these are the colomn to be drop which meant to no use to me .
df.head(5)

In [None]:
df3 = df3.drop(['paper_id','affiliations','abstract','raw_bibliography','raw_authors'], axis=1) #these are the colomn to be drop which meant to no use to me .
df.head(5)

In [None]:
df.head(10) #data after rename

In [None]:
df1.head()

In [None]:
df2.head()

In [None]:
df3.head()

In [None]:
df.isnull().sum()

In [None]:
df1.isnull().sum()

In [None]:
df2.isnull().sum()

In [None]:
df2.isnull().sum()

In [None]:
df.shape

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
df3.shape

In [None]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
duplicate_rows_df1 = df1[df1.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df1.shape)

In [None]:
duplicate_rows_df2 = df2[df2.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df2.shape)

In [None]:
duplicate_rows_df3 = df3[df3.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df3.shape)

In [None]:
df = df.drop_duplicates()
df.head(5)

In [None]:
df1 = df1.drop_duplicates()
df1.head(5)

In [None]:
df2 = df2.drop_duplicates()
df2.head(5)

In [None]:
df.shape

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
df3.shape

In [None]:
df = df.dropna()
df.count()

In [None]:
df1 = df1.dropna()
df1.count()

In [None]:
df2 = df2.dropna()
df2.count()

In [None]:
df3 = df3.dropna()
df3.count()

In [None]:
df.shape

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
df3.shape

In [None]:
df.isnull().sum() #checking if there are some null values

In [None]:
df1.isnull().sum()

In [None]:
df2.isnull().sum()

In [None]:
df3.isnull().sum()

In [None]:
import os
for dirname, _, filenames in os.walk('../input/novel-corona-virus-2019-dataset'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df_main = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')

In [None]:
df_main.shape

In [None]:
df_main.head()

In [None]:
df_m =df_main.drop(['Lat','Long'], axis=1,inplace=True)

In [None]:
df_main.head()

In [None]:
#using groupby command set the country/region as index of the dataframe.
corona_dataset_aggregated = df_main.groupby("Country/Region").sum()

In [None]:
corona_dataset_aggregated.head()

In [None]:
corona_dataset_aggregated.shape

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
from sklearn import datasets , linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr=LinearRegression(normalize=True)
from sklearn.metrics import accuracy_score

In [None]:
corona_dataset_aggregated.loc["China"].plot()
corona_dataset_aggregated.loc["Italy"].plot()
corona_dataset_aggregated.loc["Spain"].plot()
plt.legend()

In [None]:
corona_dataset_aggregated.loc['China'].plot()

In [None]:
corona_dataset_aggregated.loc["China"][:3].plot()

In [None]:
corona_dataset_aggregated.loc["China"].diff().plot() #calculating the first derivative of the curve

In [None]:
corona_dataset_aggregated.loc["China"].diff().max() #find maxmimum infection rate for China

In [None]:
corona_dataset_aggregated.loc["Italy"].diff().max()

In [None]:
corona_dataset_aggregated.loc["Spain"].diff().max()

In [None]:
# find maximum infection rate for all of the countries
countries =list(corona_dataset_aggregated.index)

max_infection_rates=[]
for c in countries:
     max_infection_rates.append(corona_dataset_aggregated.loc[c].diff().max())
corona_dataset_aggregated["max_infection_rate"]= max_infection_rates

In [None]:
corona_dataset_aggregated.head()

In [None]:
#create a new dataframe with only needed column
corona_data = pd.DataFrame(corona_dataset_aggregated["max_infection_rate"])
corona_data.head(10)

In [None]:
#new task
#Importing the WorldHappinessReport.csv dataset---1
#selecting needed columns for our analysis--2
#join the datasets---3
#calculate the correlations as the result of our analysis---4
happiness_report_csv = pd.read_csv('../input/happiness/worldwide_happiness_report.csv')

In [None]:
happiness_report_csv.head()

In [None]:
happiness_report_csv.shape

In [None]:
#changing the indices of the dataframe
happiness_report_csv.set_index("Country or region",inplace=True)

In [None]:
happiness_report_csv.head()

In [None]:
#now let's join two dataset we have prepared
#Corona Dataset :
corona_data.head()

In [None]:
corona_data.shape

In [None]:
#world happiness report Dataset :
happiness_report_csv.head()

In [None]:
happiness_report_csv.shape

In [None]:
data = corona_data.join(happiness_report_csv,how="inner")
data.head()

In [None]:
#BOX PLOT GRAPHS FOR DIFFERENT CATEGORIEs BEFORE REMOVING OUTLIERS
sns.boxplot(x=data['max_infection_rate'])

In [None]:
sns.boxplot(x=data['GDP per capita'])

In [None]:
sns.boxplot(x=data['Generosity'])

In [None]:
sns.boxplot(x=data['Healthy life expectancy'])

In [None]:
sns.boxplot(x=data['Overall rank'])

In [None]:
#STEP 5 REMOVING OUTLIERS USING IQR METHOD
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
#CHECKING THE SHAPE OF THE DATAFRAME AFTER REMOVING OUTLIERS¶
data = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]
data.shape

In [None]:
#Checking whether the outliers are removed are not¶
sns.boxplot(x=data['max_infection_rate'])

In [None]:
data= data.reset_index()
data.head()

In [None]:
#STEP 6 CHECKING TOP 15 Countries MOST REPRESENTED IN DATASET¶
counts = data['index'].value_counts()*100/sum(data['index'].value_counts())
popular_labels = counts.index[:20]
plt.figure(figsize=(10,5))
plt.barh(popular_labels, width=counts[:20])
plt.title('Top 25 Countries mostly represented in dataset')
plt.show()

In [None]:
#correlation matrix
data.corr() ##FINDING THE CORRELATION MATRIX FROM HERE

In [None]:
corrMatrix = data.corr()
sns.heatmap(corrMatrix,annot=True)

DOCUMENTING INSIGHTS **
from the above it is concluded that max_infection_rate positively depends on  Healthy Life Expectancy(+0.23),GDP/capita(+0.21),Score(+0.20) . Even they three are positiviely depend on each others as well. 
Overall rank(-0.19) have negatively strong correlation with max_infection_rate.

positive correlation holds directly proportion relation and negative coorealtion holds inversely relation with thier respeccitve factors affecting them.

In [None]:
sns.barplot(data['max_infection_rate'],data['Healthy life expectancy'])

In [None]:
#plotting diff graphs and also checking how other variables affect the Pandemic's Max_infection_rate (EDA)
# rate v/s  GDP
sns.barplot(data['max_infection_rate'],data['GDP per capita'])

In [None]:
sns.barplot(data['max_infection_rate'],data['Score'])

In [None]:
sns.barplot(data['max_infection_rate'],data['Overall rank'])

In [None]:
data['max_infection_rate'].plot.hist()
plt.xlabel('max_infection_rate', fontsize=12)

In [None]:
(data['max_infection_rate'].loc[data['max_infection_rate']<4.223125e+04 ]).plot.hist()

In [None]:
data['Healthy life expectancy'].plot.hist()
plt.xlabel('Life expectancy', fontsize=12)

In [None]:
data['GDP per capita'].plot.hist()
plt.xlabel('GDP', fontsize=12)

In [None]:
data['Score'].plot.hist()
plt.xlabel('rank', fontsize=12)

In [None]:
data['Overall rank'].plot.hist()
plt.xlabel('rank', fontsize=12)

In [None]:
fig, ax=plt.subplots(figsize=(5,5))
ax.scatter(data['max_infection_rate'],data['Healthy life expectancy'])
plt.title('Scatter between life_expectancy and max_infection_rate')
ax.set_xlabel=('max_infection_rate')
ax.set_ylabel=('life_expectancy')
plt.show()

In [None]:
fig, ax=plt.subplots(figsize=(5,5))
ax.scatter(data['max_infection_rate'],data['GDP per capita'])
plt.title('Scatter between GDP and max_infection_rate')
ax.set_xlabel=('max_infection_rate')
ax.set_ylabel=('GDP')
plt.show()

In [None]:
fig, ax=plt.subplots(figsize=(5,5))
ax.scatter(data['max_infection_rate'],data['Overall rank'])
plt.title('Scatter between Rank and max_infection_rate')
ax.set_xlabel=('max_infection_rate')
ax.set_ylabel=('Rank')
plt.show()

In [None]:
x = data['Healthy life expectancy']
y = data["max_infection_rate"]
sns.scatterplot(x,np.log(y))

In [None]:
sns.regplot(x,np.log(y))

In [None]:
x = data['GDP per capita']
y = data["max_infection_rate"]
sns.scatterplot(x,np.log(y))

In [None]:
sns.regplot(x,np.log(y))

In [None]:
x = data['Overall rank']
y = data["max_infection_rate"]
sns.scatterplot(x,np.log(y))

In [None]:
sns.regplot(x,np.log(y))

Just Splitting Data set In 80:20 using Pareto principle¶

In [None]:
pm = data.select_dtypes(exclude=[np.number]) 
pm

In [None]:
#It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels9.2 . Encode target labels with value between 0 and n_classes-1.
#
#This transformer should be used to encode target values, i.e. y, and not the input X. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels
#9.3 PREPROCESSSING THE DATA BEFORE SPLITTING¶
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()
for i in pm:
  data[i] = label_enc.fit_transform(data[i])
print('Label Encoded Data')
data.head()

In [None]:
from sklearn import preprocessing
X = np.asarray(data[['Healthy life expectancy','GDP per capita','Score','Overall rank']])
y = np.asarray(data['max_infection_rate'])
X = preprocessing.StandardScaler().fit(X).transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
#9.4 LINEAR REGRESSION MODEL WITH PRICE AS A TARGET VARIABLE¶
# Our algorithm should be helping us to draw a regression line to predict the cities which can provide the best profitability • We will use scikitlearn’s Linear Regression to train our model on both the training and test

from sklearn import linear_model
lm = linear_model.LinearRegression()
model = lm.fit(X_train,y_train)
predictions = lm.predict(X_test)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train,y_train)
y_pred = model.predict(X_test)
model.score(X_test,y_pred)

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 
d_m = DecisionTreeClassifier(random_state = 0)
d_m.fit(X_train,y_train)
y_pred = d_m.predict(X_test)
print("Confusion Matrix:\n\n", confusion_matrix(y_test, y_pred)) 
print ("\nAccuracy : ", accuracy_score(y_test,y_pred)*100)

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
model.predict(X_test)

In [None]:

y_pred = model.predict(X_test) 
plt.plot(y_test, y_pred, '.')

# plot a line, a perfit predict would all fall on this line
x = np.linspace(0, 330, 100)
y = x
plt.plot(x, y)
plt.show()

In [None]:
y_pred = model.predict(X_test) 
plt.plot(y_test, y_pred, '.')

# plot a line, a perfit predict would all fall on this line
x = np.linspace(0, 330, 100)
y = x
plt.plot(x, y)
plt.show()


In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import math

print('MSE: %.2f' % mean_squared_error(y_test, y_pred))
print('R Squared : %.2f' % r2_score(y_test, y_pred))
print('MAE :%.2f' % mean_absolute_error(y_test, y_pred))
print('RMSE : %.2f' % math.sqrt(mean_squared_error(y_test, y_pred)))

In [None]:
#multiple regression

def abss(x):
    if x>=0: return x
    else: return -1*x

In [None]:
print(data.head())
headers=data.dtypes.index
header=headers.tolist()
features=[data['Healthy life expectancy'],data['GDP per capita']]
Y=np.array(data['max_infection_rate'])

In [None]:
X=[]
depth=len(Y)
width=len(features)
for i in range(depth):
    X.append([1])
    for j in range(width):
        X[i].append(features[j][i])
    #X[i].append(Y[i])

In [None]:
X

In [None]:
X=np.matrix(X)
Y=np.mat(Y).T
print(X)
#XT_X=np.dot(X.T,X)
#XT_Y=np.dot(X.T,Y)

In [None]:
X.T

In [None]:
inverse=np.dot(X.T,X).I
intermediate_rep=np.dot(inverse,X.T)
beta=np.dot(intermediate_rep,Y)

In [None]:
beta

In [None]:
new_y=np.dot(X,beta)

In [None]:
#print(new_y)
error=Y-new_y
print(error)

In [None]:
correct=0
for val in error:
    if val<=5:
        correct+=1
accuracy=correct/float(len(error))
print(accuracy)

In [None]:
plt.plot(np.array(X[:,1]),np.array(new_y))

In [None]:
plt.plot(np.array(X[:,1]),np.array(Y-new_y))

In [None]:
X[:,1]

In [None]:
print(error)
for val in error:
    val*=val

In [None]:
error

In [None]:
sum(error)

In [None]:
s=0
for val in error:
    s+=val
print(s)

In [None]:
for i in range(30):
    print(Y[i],new_y[i], Y[i]-new_y[i])

In [None]:
new_y[0]

In [None]:
#Logistic regression
z=np.arange(-6,6,0.1)
sigmoid=1/(1+np.exp(-z))
fig=plt.figure('cost function')
plt.plot(z,sigmoid)
plt.grid(True)
plt.xlabel('input z')
plt.ylabel('output')
plt.title('sigmoid fucntion graph')
plt.show

In [None]:
t=np.ones((2,3))
t.shape

In [None]:
def hypothesis(theta, X, n):
    h = np.ones((X.shape[0],1))
    theta = theta.reshape(1,n+1)
    for i in range(0,X.shape[0]):
        h[i] = 1 / (1 + np.exp(-float(np.matmul(theta, X[i]))))
    h = h.reshape(X.shape[0])
    return h

In [None]:
def BGD(theta, alpha, num_iters, h, X, y, n):
    theta_history = np.ones((num_iters,n+1))
    cost = np.ones(num_iters)
    for i in range(0,num_iters):
        theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
        for j in range(1,n+1):
            theta[j]=theta[j]-(alpha/X.shape[0])*sum((h-y)
                               *X.transpose()[j])
        theta_history[i] = theta
        h = hypothesis(theta, X, n)
        cost[i]=(-1/X.shape[0])*sum(y*np.log(h)+(1-y)*np.log(1 - h))
    theta = theta.reshape(1,n+1)
    return theta, theta_history, cost

In [None]:
def logistic_regression(X, y, alpha, num_iters):
    n = X.shape[1]
    one_column = np.ones((X.shape[0],1))
    X = np.concatenate((one_column, X), axis = 1)
    # initializing the parameter vector...
    theta = np.zeros(n+1)
    # hypothesis calculation....
    h = hypothesis(theta, X, n)
    # returning the optimized parameters by Gradient Descent...
    theta,theta_history,cost = BGD(theta,alpha,num_iters,h,X,y,n)
    return theta, theta_history, cost

In [None]:
X_train = X[:,[1,2]] # feature-set
y_train = new_y # label-set

In [None]:
x0 = np.ones((np.array([x for x in y_train if x == 0]).shape[0], 
              X_train.shape[1]))
x1 = np.ones((np.array([x for x in y_train if x == 1]).shape[0], 
              X_train.shape[1]))
#x0 and x1 are matrices containing +ve and -ve examples from the
#dataset, initialized to 1
k0 = k1 = 0
for i in range(0,y_train.shape[0]):
    if y_train[i] == 0:
        x0[k0] = X_train[i]
        k0 = k0 + 1
    else:
        x1[k1] = X_train[i]
        k1 = k1 + 1
X = [x0, x1]
colors = ["green", "blue"] # 2 distinct colours for 2 classes 
import matplotlib.pyplot as plt
for x, c in zip(X, colors):
    if c == "green":
        plt.scatter(x[:,0],x[:,1],color = c,label = "Not Admitted")
    else:
        plt.scatter(x[:,0], x[:,1], color = c, label = "Admitted")
plt.xlabel("Marks obtained in 1st Exam")
plt.ylabel("Marks obtained in 2nd Exam")
plt.legend()

In [None]:
data.replace('?',np.nan,inplace=True)

#delete all the rows that have NaN in them
dk=data.dropna()

#deleting the column with id, 1 in the argument indicates 'column', so this will delete the 'column' containing 'id'
#df.drop(['id'],1,inplace=True)
full_data=data.values.tolist()
headers = data.dtypes.index
header=headers.tolist()

hle=np.array(data['Healthy life expectancy'], dtype=np.int64)
mir=np.array(data['max_infection_rate'], dtype=np.int64)
xs=hle
ys=mir
print(data['Healthy life expectancy'])

In [None]:
def shift_origin(xs,ys):
    x_mean=np.mean(xs)
    y_mean=np.mean(ys)
    shifted_x=xs-x_mean
    shifted_y=ys-y_mean
    #plt.scatter(shifted_x,shifted_y)
    return shifted_x,shifted_y,x_mean,y_mean

In [None]:
shifted_x,shifted_y,x_mean,y_mean=shift_origin(xs,ys)

In [None]:
def sum_squared_mean(xs,ys):
    m=0     #initial slope
    m_list=[]    #list of all slopes throughout the program
    learning=0.05  #initial learning rate
    squared_error=[]  #
    squared_error_list=[] #list of all squared errors
    sum=0
    
    #shift of mean of the passed dataset
    shifted_x,shifted_y,x_mean,y_mean=shift_origin(xs,ys)
    
    #limit for the minimum learning rate
    while learning>0.000001:
        
        
        #calculate the squared error
        sum=0
        for i in range(1,len(xs)+1):
            sum+=(m*shifted_x[i]-shifted_y[i])**2
         
        #append error and slope in lists
        squared_error_list.append(sum)
        m_list.append(m)
        
        #if errror is 0 then the slope is perfect
        if sum==0:
            return m
        
        if len(squared_error)==0:
                squared_error.append(sum)
        
        #if previous error = new error then stop
        elif sum==squared_error[-1]:
            return m
        else:
            #if new error is greater than previous error, take 2 steps back and reduce the learning rate
            #also remove the squared errors for the last 2 steps 
            if squared_error[-1]<sum:
                m-=learning*2
                learning=learning/float(10)
                del(squared_error[-2:])
                squared_error.append(sum)
            else:
                squared_error.append(sum)
        
        #increment the slope by learning rate
        m+=learning
        
    
    return m,m_list,squared_error_list

In [None]:
m,m_list,squared_error_list=sum_squared_mean(np.abs(xs),np.abs(ys))

#best slope returned by the function
m

In [None]:
# Support Vector Machines (SVM)
from sklearn.svm import SVC
svc = SVC(kernel='linear')
svc.fit(X_train, y_train)


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report 
confusion_matrix(y_pred,y_test)

In [None]:
y_pred = svc.predict(np.random.random((3,8)))

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
#Classification Report
#from sklearn.metrics import classification_report

In [None]:
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)


In [None]:
y_pred = knn.predict_proba(X_test)
knn.score(X_test, y_test)


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#Confusion Matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
# K Means
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)
>>> y_pred = k_means.predict(X_test)
from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
 Classification Report
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred))
 Confusion Matrix
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
pca_model = pca.fit_transform(X_train)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)