<a href="https://colab.research.google.com/github/xalejandrow/hypothesis-testing-exercises-project-with-python/blob/main/Framework_KNN_4GKS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Science: Concept, Framework

### 1. Basic concept
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge[1]. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains[2].

### 2. A proposed framework


<img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" />

Figure 1. A typical data science project (From R for Data Science, H. Wickham et al. https://r4ds.had.co.nz/introduction.html)

- Import your data: This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame. No data, no data science.

- Tidy it: It means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. 

- Transform it. It includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). 

- Visualise it: It's a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data. Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them.

- Model it: Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.

- Communicate it: It's an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.

Surrounding all these tools is programming. Programming is a cross-cutting tool that you use in every part of the project. You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [None]:
iris = load_iris()

In [None]:
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['Species'] = iris.target
df_iris['Species'] = df_iris['Species'].replace(to_replace = [0,1,2], value = ['setosa','versicolor','virginica'])
df_iris['Species'] = pd.Categorical(df_iris['Species'])

In [None]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   sepal length (cm)  150 non-null    float64 
 1   sepal width (cm)   150 non-null    float64 
 2   petal length (cm)  150 non-null    float64 
 3   petal width (cm)   150 non-null    float64 
 4   Species            150 non-null    category
dtypes: category(1), float64(4)
memory usage: 5.1 KB


In [None]:
df_iris.sample(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
30,4.8,3.1,1.6,0.2,setosa
67,5.8,2.7,4.1,1.0,versicolor
95,5.7,3.0,4.2,1.2,versicolor
46,5.1,3.8,1.6,0.2,setosa
86,6.7,3.1,4.7,1.5,versicolor
9,4.9,3.1,1.5,0.1,setosa
4,5.0,3.6,1.4,0.2,setosa
120,6.9,3.2,5.7,2.3,virginica
82,5.8,2.7,3.9,1.2,versicolor
89,5.5,2.5,4.0,1.3,versicolor


In [None]:
df_iris.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [None]:
fig = px.scatter_matrix(df_iris, dimensions=['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'], color='Species')
fig.show()

In [None]:
# Let's process the data
standardizer = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
X_train_scaled = standardizer.fit_transform(X_train)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

KNeighborsClassifier(n_neighbors=3)

In [None]:
X_test[0]
y_test[0]

2

In [None]:
X_test_scaled = standardizer.transform(X_test[0].reshape(-1,4))
X_test_scaled

array([[-0.09984503, -0.57982483,  0.72717965,  1.51271377]])

In [None]:
prediction = knn.predict(X_test_scaled)
prediction

array([2])