# Tutorial: Your Data Science Workflow w/ Python

**Author: Nikhil Praveen (@nxrprime)** <br>
*3/25/2020*

---

# Table of contents

---

I. **Introduction**
    + Our workflow
    + Explanation    
        + Why a workflow?
II. **Part 1. EDA**
    + Statistical Approaches
    + Getting started
        + Loading libraries
        + Background on the data
        +  Loading data
        + Preprocessing data
        + Inspecting our data
    + Working with our data
III. **Part 2. Feature Engineering**
    + What is feature engineering?
    + Working with FE
    + Creating some more features
IV. **Part 3. Modeling**
    + What is modeling?
    + Our approaches
        + Decision Trees
            + Decision Trees Explained
        + k-Nearest Neighbours
            + kNN Classification Explained
            + kNN Regression Explained
        + Regression
            + Logistic Regression
            + Lasso regression
            
V. **Part 4. Ensemble**
    + Simple blend
    + Weighted average
    + Majority voting

# I. Introduction

---

## Our workflow

![](https://www.dataquest.io/wp-content/uploads/2019/05/what-is-data-science-workflow.jpg)

*Image credits: Dataquest*

The workflow we are using is a simple, brief workflow. The ensembling step that I have included is not always necessary, but it helps with certain datasets. In a Kaggle competition, ensembling has proven to be very helpful time and time again.

## Explanation

### Why a workflow?

A workflow helps you to:
* Structure your projects
* Solve problems efficiently
* Take a route to the problem which helps you to have the best solution

# II. **Part 2 - EDA**

---

![](https://www.jeannjoroge.com/images/edamethods.png)

---

EDA stands for:
* E- Exploratory
* D- Data
* A- Analysis

## Statistical approaches

In an EDA, we can use several statistical approaches for a more **concise yet boiled down** approaches. After all, data science is fundamentally statistics in the end.
    
<i>Note: The following equations are taken from Wikipedia since I am not very adept at using LaTeX</i>
If someone wanted to use Pearson correlation coefficient for an EDA, the equation would look like:

Suppose that we have two variables of interest, denoted as X and Y, and suppose that we have a bivariate sample of size n:
$(X1 , Y1 ), (X2 , Y2 ), ... , (Xn , Yn )$
 
and we define the following:

$\bar{X}=\frac{1}{n}\sum_{i=1}^{n}X_i , S_{XX}=\frac{1}{n-1}\sum_{i=1}^{n}(X_i-\bar{X})^2$ <br>
$\bar{Y}=\frac{1}{n}\sum_{i=1}^{n}Y_i , S_{YY}=\frac{1}{n-1}\sum_{i=1}^{n}(Y_i-\bar{Y})^2$ <br>
$\bar{Y}=\frac{1}{n}\sum_{i=1}^{n}Y_i , S_{YY}=\frac{1}{n-1}\sum_{i=1}^{n}(Y_i-\bar{Y})^2$ <br>

The statistics above represent:
* X's sample mean
* Y's sample mean
* Y's sample covariance

The sample Pearson correlation coefficient (also called the sample product-moment correlation coefficient) for measuring the association between variables X and Y is given by the following formula:

$r_p=\frac{S_{XY}}{\sqrt{S_{XX}S_{YY}}}$

## Getting started

We shall use the [coronavirus (COVID-19) dataset by SRK](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset) for our exploratory data analysis. 

### Loading libraries

**These are the libraries we shall load and their function**:
    + Pandas: CSV I/O, dataframe operations.
    + Numpy : Linear algebra and array/matrix manipulations.
    + MatPlotLib: Plotting charts
    + Seaborn: Charts 
    + Plotly : Interactivity in plots


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.graph_objects as go
import plotly as px

### Background on the disease

The 2019 Novel Coronavirus, or 2019-nCoV, is a new respiratory virus first identified in Wuhan, Hubei Province, China. A novel coronavirus (nCoV) is a new coronavirus that has not been previously identified. The 2019 novel coronavirus (2019-nCoV), is not that same as the coronaviruses that commonly circulate among humans and cause mild illness, like the common cold.

A diagnosis with coronavirus 229E, NL63, OC43, or HKU1 is not the same as a 2019-nCoV diagnosis. These are different viruses and patients with 2019-nCoV will be evaluated and cared for differently than patients with common coronavirus diagnosis.

Public health officials and partners are working hard to identify the source of the 2019-nCoV. Coronaviruses are a large family of viruses, some causing illness in people and others that circulate among animals, including camels, cats and bats. Analysis of the genetic tree of this virus is ongoing to know the specific source of the virus. SARS, another coronavirus that emerged to infect people, came from civet cats, while MERS, another coronavirus that emerged to infect people, came from camels. More information about the source and spread of 2019-nCoV is available on the 2019-nCoV Situation Summary: Source and Spread of the Virus.

This virus probably originally emerged from an animal source but now seems to be spreading from person-to-person. It’s important to note that person-to-person spread can happen on a continuum. Some viruses are highly contagious (like measles), while other viruses are less so. At this time, it’s unclear how easily or sustainably this virus is spreading between people. Learn what is known about the spread of newly emerged coronaviruses.

Reference: https://www.cdc.gov/coronavirus/2019-ncov/faq.html

### CSV operations

CSV stands for:

+ C - Comma
+ S - Separated
+ V - Values

These are a very common file type that you will notice in data science.

We will use the Pandas function `read_csv()` to read in CSV files and load them as DataFrames in our workspace.

In [None]:
df = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv',parse_dates=['Last Update'])
df.rename(columns={'ObservationDate':'Date', 'Country/Region':'Country'}, inplace=True)

conf = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv")
recov = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_recovered.csv")
death = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv")

### Preprocessing data

Preprocessing allows us to change the data; i.e drop some columns, rename columns, extract some columns etc.

We have a pretty good template for data preprocessing:

`
in_df = in_df.rename(columns = {'SomeColumn': 'SomeOtherColumn'})
`

In [None]:
conf.rename(columns={'Country/Region':'Country'}, inplace=True)
recov.rename(columns={'Country/Region':'Country'}, inplace=True)
death.rename(columns={'Country/Region':'Country'}, inplace=True)

### Inspecting data

We can use the Pandas functions `df.head` and `df.tail`.

In [None]:
df.head()

In [None]:
conf.head()

In [None]:
death.head()

In [None]:
recov.head()

## Working with the data

Let's use an interactive barplot to see the cases of confirmed over time:

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x=df['Date'],
                y=df['Confirmed'],
                name='Confirmed',
                marker_color='blue'
                ))

fig.update_layout(
    title='Worldwide Corona Virus Cases - Confirmed(Bar Chart)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Number of Cases',
        titlefont_size=16,
        tickfont_size=14,
    )
)
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x=df['Date'],
                y=df['Deaths'],
                name='Deaths',
                marker_color='red'
                ))

fig.update_layout(
    title='Worldwide Corona Virus Cases - Deaths(Bar Chart)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Number of Cases',
        titlefont_size=16,
        tickfont_size=14,
    )
)
fig.show()

---

## This is a work in progress. If it has helped you, please consider upvoting.

---