<center><h1>Introduction to the Data Science Toolbox</h1></center>

## Table of contents

* [SciPy](#tbx_scipy) [Math & Stat]
* [StatsModels](#tbx_stats) [Math & Stat]
* [Numpy](#tbx_numpy) [Math & Stat]
* [Pandas](#tbx_pandas) [Data Processing]
* [Scikit-learn](#tbx_sklearn) [Machine Learning]
* [Keras](#tbx_keras) [Deep Learning]
* [Matplotlib, Seaborn](#tbx_vis) [Visualization]


<a id='tbx_scipy'></a>
## SciPy

<img src="imgs/scipy.png" alt="drawing" width="200">

**The SciPy ecosystem**

Scientific computing in Python builds upon a small core of packages:

**Python**, a general purpose programming language. It is interpreted and dynamically typed and is very suited for interactive work and quick prototyping, while being powerful enough to write large applications in.

**NumPy**, the fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them.

The **SciPy** library, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more.

**Matplotlib**, a mature and popular plotting package, that provides publication-quality 2D plotting as well as rudimentary 3D plotting

* Website: https://www.scipy.org/
* Scipy Lecture Notes: https://scipy-lectures.org/

In [3]:
import numpy as np

<a id='tbx_numpy'></a>
## Numpy

<img src="imgs/numpy.png" alt="drawing" width="400">

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

* a powerful N-dimensional array object
* sophisticated (broadcasting) functions
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

* Website: https://www.numpy.org/
* Numpy Tutorial: http://cs231n.github.io/python-numpy-tutorial/

<a id='tbx_pandas'></a>
## Pandas

<img src="imgs/pandas.png" alt="drawing" width="500">

**pandas** is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

* Website: https://pandas.pydata.org/

<a id='tbx_sklearn'></a>
## Scikit-learn

<img src="imgs/sklearn.png" alt="drawing" width="300">

Scikit-learn enables you to do Machine Learning in Python.

* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

Website: https://scikit-learn.org/stable/

<a id='tbx_keras'></a>
## Keras

<img src="imgs/keras.png" alt="drawing" width="400">

**Keras** is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

Use Keras if you need a deep learning library that:

* Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility).
* Supports both convolutional networks and recurrent networks, as well as combinations of the two.
* Runs seamlessly on CPU and GPU.

Website: https://keras.io/

<a id='tbx_vis'></a>
## Matplotlib, Seaborn

<img src="imgs/matplotlib.png" alt="drawing" width="400">

**Matplotlib** is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

**Seaborn** is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

* Website: https://matplotlib.org/
* Website: https://seaborn.pydata.org/

<center><h1>Diving into Data Science</h1></center>

## Table of contents
* [Data Science Pipeline](#pipeline)
* [Data Collection](#collection)
* [Data Exploration](#exploration)
* [Data Preprocessing](#preprocessing)
* [Data Modeling](#modeling)
* [Model Validation](#validation)
* [Communication](#communication)

<a id='pipeline'></a>
## Data Science Pipeline

<img src="imgs/data_pipeline.png" alt="drawing" width="1200">

<a id='collection'></a>
## 1. Data Collection

<center><h3>90% OF THE WORLD’S DATA WAS CREATED IN THE LAST 2 YEARS</h3></center>

**Definition:** Data collection is a process of collecting information from all the relevant sources to find answers to the research problem, test the hypothesis and evaluate the outcomes.

Information you gather can come from a range of sources. Likewise, there are a variety of techniques to use when gathering primary data. Listed below are some of the most common data collection techniques. [**Source: <a href="https://cyfar.org/data-collection-techniques">cyfar</a>**]

<h3 style="color:navy;display:inline;">Technique: <span style="color:black">Interview</span></h3> 

**Key Facts:**
* Interviews can be conducted in person or over the telephone
* Interviews can be done formally (structured), semi-structured, or informally
* Questions should be focused, clear, and encourage open-ended responses
* Interviews are mainly qualitative in nature

**Example:** One-on-one conversation with parent of at-risk youth who can help you understand the issue.

<h3 style="color:navy;display:inline;">Technique: <span style="color:black"> Questionnaires and Surveys</span></h3> 

**Key Facts:**
* Responses can be analyzed with quantitative methods by assigning numerical values to Likert-type scales
* Results are generally easier (than qualitative techniques) to analyze
* Pretest/Posttest can be compared and analyzed

**Example:** Results of a satisfaction survey or opinion survey.

<h3 style="color:navy;display:inline;">Technique: <span style="color:black"> Observations</span></h3> 

**Key Facts:**
* Allows for the study of the dynamics of a situation, frequency counts of target behaviors, or other behaviors as indicated by needs of the evaluation.
* Good source for providing additional information about a particular group, can use video to provide documentation.
* Can produce qualitative (e.g., narrative data) and quantitative data (e.g., frequency counts, mean length of interactions, and instructional time).

**Example:** Site visits to an after-school program to document the interaction between youth and staff within the program.

<h3 style="color:navy;display:inline;">Technique: <span style="color:black"> Focus Groups</span></h3> 

**Key Facts:**
* A facilitated group interview with individuals that have something in common.
* Gathers information about combined perspectives and opinions.
* Responses are often coded into categories and analyzed thematically.

**Example:** A group of parents of teenagers in an after-school program are invited to informally discuss programs that might benefit and help their children succeed.

<h3 style="color:navy;display:inline;">Technique: <span style="color:black"> Documents and Records</span></h3> 

**Key Facts:**
* Consists of examining existing data in the form of databases, meeting minutes, reports, attendance logs, financial records, medical records, newsletters, PDFs etc.
* This can be an inexpensive way to gather information but may be an incomplete data source

**Example:** To understand the primary reasons students miss school, records on student absences are collected and analyzed.

<h3 style="color:navy;display:inline;">Technique: <span style="color:black"> Web Scraping and Web APIs</span></h3> 

**Key Facts:**
* Involves scraping and extracting data (text, tables, images etc) from different websites. Using APIs like Facebook Graph API, Twitter API etc.
* This method requires web programming knowledge.

**Example:** To understand the user sentiment of Facebook online shop pages, comments are collected from the posts. 

**Selinium and Beautifulsoup is two popular Data Scrapping tool**

<h3 style="color:navy;display:inline;">Technique: <span style="color:black"> Sensors, Mobile and IoT Device</span></h3> 

**Key Facts:**
* Collects data from sensors, EEG Signals, MRI, Mobile SMS, Hand-badges, IoT Devices.
* Data collected from these sources are often noisy and sometimes difficult import/prepare into readable format.

**Example:** To classify eye state EEG Signal data are collected.

### Opensource Datasets

There are lots of opensourced datasets in the web. 
The most popular 2 data sources are:

1. **UCI Machine Learning Repository**: https://archive.ics.uci.edu/ml/index.php
2. **Kaggle**: https://www.kaggle.com/datasets

Others:
* **We have a data source from the govt.:** http://data.gov.bd/
* **Github:** https://github.com/awesomedata/awesome-public-datasets
* **Google Data Search:** https://toolbox.google.com/datasetsearch/

<a id='exploration'></a>
## 2. Data Exploration

**Definition:** Data exploration aka EDA (Exploratory Data Analysis) is the initial step in data analysis, where a data analyst uses visual exploration to understand a data set uncovering its initial patterns, characteristics, and points of interest.

In other words, EDA is used to understand, summarize and analyse the contents of a dataset, usually to investigate a specific question or to prepare for more advanced modeling.

Few basic data exploration approaches are discussed below:

### Missing Value Analysis (Exploration Phase)

In real-world data is often missing. There are few reasons why data goes missing.


### Treatment (Preprocessing Phase)
* In the first two cases, it is safe to remove the data with missing values depending upon their occurrences.
* In the third case removing observations with missing values can produce a bias in the model. So we have to be really careful before removing observations.
* Imputing missing values.
* Using models that are robust to missing values.

### Outlier Analysis

**Outlier:** An outlier is a data point that differs significantly from other observations.

An outlier can cause serious problems in training models.

#### Outlier detection (Exploration Phase)
* One of the best visualization approaches to use are boxplots for univariate analysis and scatterplots for bi-variate analysis.
* Use the interquartile range theory.

#### Treatment (Preprocessing Phase)
* We can either delete them if they are very few or if not.
* We can use a special treatment like triming, imputing them.
* Use models that are robust to outliers.

### Class Imbalance Analysis

**Definition:** Data are said to suffer the Class Imbalance Problem when the class distributions are highly imbalanced. In this context, classification learning algorithms have low predictive accuracy for the infrequent class.

#### Class Imbalance detection (Exploration Phase)
* Use Bar Plot or Pie Chart to visualize the distribution of the classes (target variable).
* Simly make a frequency table for the classes (target variable).

#### Treatment (Preprocessing Phase)
* Apply sampling methods like Oversampling or Undersampling.
* Apply class weighting while training models.
* Use voting or ensemble strategy.

<a id='preprocessing'></a>
## 3. Data Preprocessing

**Definition:** Data preprocessing is the phase that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

Data preprocessing is the first (and arguably most important) step toward building a working machine learning model. It's critical! If your data hasn't been cleaned and preprocessed, your model won't work. It's that simple.

Data preprocessing is generally thought of as the boring part. However, it makes the difference between the best performing model and the other models.

Here are few basic tasks that are involved in the data preprocessing phase.

### Cleaning Noises
Real-world data can often contain extra noises. 

**For example:**
* Online survey data may contain special characters.
* Missing cell may contain special characters than space or empty cell.
* Web scraped data often includes unnecessary HTML tags.
* Image, voice and signal data are often noisy and unclear.

Such extra noises should be removed at first.

### Missing Value Treatment
Missing value treatment is one of the common scenario in the data preprocessing phase. The detection of missing values is performed in the data exploration phase. After identifying the probable reason behind the missingness, we should choose a missing value treatment strategy and apply to the missing data in this phase. 

### Outlier Treatment

Real-world data often contains outliers. Treatment for the outliers are crucial for training a good model; otherwise it may harm the learnability of the models. After detecting the outliers in the exploration phase we need to choose a strategy for handling the extreme values and apply it in this phase.

### Imbalanced Class Treatment

This is another problematic scenario in classification tasks. Imbalanced target classes make your model biased. The model tend to learn the most from the majority class and can not learn the patterns for the minority class. The imbalanced class and its ratio should be addressed in the exploration phase. In this preprocessing phase, we should apply a imbalanced class handling strategy to balance the learning of the model.

### Feature Encoding

Feature encoding simply means converting the categorical features into numerical. No algorithms understands character/text values. We must represent the categorical features in our dataset into a numerical form to feed the data to the model. The most commonly used 2 feature encoding methods are described below.

#### Label Encoding
Label Encoding is simply encoding/replacing the categorical value with a number that is unique for that particular class.

Label Encoding in Python can be achieved using Sklearn/Pandas Library. Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. **LabelEncoder** encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.

**For Example:**
<img src="imgs/label_encoding.png" alt="drawing" width="500">

But depending on the data, label encoding introduces a new problem. For example, we have encoded a set of country names into numerical data. This is actually categorical data and there is no relation, of any kind, between the rows.

The problem here is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 <2.

The model may derive a correlation like as the country number increases the population increases but this clearly may not be the scenario in some other data or the prediction set. To overcome this problem, we use One Hot Encoding.


#### One Hot Encoding

One hot encoding takes a column which has categorical data and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

**For Example:**
<img src="imgs/one_hot_encoding.png" alt="drawing" width="600">

One hot encoding has also problems, it increases the dimentionality of the dataset.

#### When to use Lablen Encoding and One Hot Encoding

* Use Label Encoding for tree-based algorithms.
* Use One Hot Encoding for distance-based algorithms.

### Data Scaling 
Data Scaling aka Feature scaling is a method used to standardize the range of independent variables or features of data.


#### Why to scale the features?
Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.

The most 3 popular feature scaling methods are described below:

#### Rescaling
It is commonly known as min-max scaling. It is the simplest method and consists in rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula is given as:

${\displaystyle x'={\frac {x-{\text{min}}(x)}{{\text{max}}(x)-{\text{min}}(x)}}}$

where ${\displaystyle x}$ is an original value, ${\displaystyle x'}$ is the rescaled value. 


#### Mean Normalization
The point of normalization is to change your observations so that they can be described as a normal distribution.

Normal distribution (Gaussian distribution), also known as the bell curve, is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean.

The general formula is given as:

${\displaystyle x'={\frac {x-{\text{mean}}(x)}{{\text{max}}(x)-{\text{min}}(x)}}}$

where ${\displaystyle x}$ is an original value, ${\displaystyle x'}$ is the normalized value. 


#### Standardization
Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1.

The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.

${\displaystyle x'={\frac {x-{\bar {x}}}{\sigma }}}$

Where $x$ is the original feature vector, ${\bar{x}={\text{average}}(x)}$ is the mean of that feature vector, and $\sigma$ is its standard deviation.

### Feature Engineering

**Definition:** Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms perform the best.

Feature Engineering is the key to build the best performing models.

**For Example:** Suppose you have some items data with **item_id, weight,** and **price**. Now, you can derive a new feature name **price_per_weight** from the **weight** and **price** columns.

<img src="imgs/feature_engineering.png" alt="drawing" width="600">

**We'll learn the feature engineering techniques in detail in future chapters. Probably on Day 09.**

<a id='modeling'></a>
## 4. Data Modeling

**Definition:** Data Modeling is the process of applying appropriate Machine Learning algorithms to fit or model the data. It is basically the process of training Machine Learning algorithms to learn the underlying patterns from the data.

Data Modeling phase involves 2 crucial decisions.


### Choose the right algorithm for your data
There should be particular reason behind selecting a particular algorithm for a problem. By default, you can use the Scikit-learn's algorithm cheat sheet to choose an algorithm to model your data.

<img src="imgs/sklearn_algo_selection.png" alt="drawing" width="800">

<div style="text-align: right"> 
    Source: <a href="https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html">scikit-learn</a>
</div>


### Tune the best hyperparameters for your algorithm

**Hyperparameter Tuning:** Hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm. 

So what is a **hyperparameter**?

**A hyperparameter** is a parameter whose value is set before the learning process begins.

**For Example:** Choosing the value of $k$ for the k-nearest neighbors algorithm.

<a id='validation'></a>
## 5. Model Validation

**Definition:** Model Validation is the phase where we use the trained model to make predictions on the unseen data. Then we use some evaluation metrics to evaluate the model performance and spot overfitting or underfitting.

This phase involves 3 tasks:

### Making predictions
* Use the trained model to make predictions on the **test data** (unseen data).

### Choose proper evaluation metrics and evaluate model performance
* According to the type of problem, we need to choose few evaluation metrics.
* For Regression: MSE, MAE, MAPE, R2 etc.
* For Classification: Accuracy, Precision, Recall, F1-Score etc.
* For Clustering: Mutual Information Score, Homogeneity Score etc.

### Spot overfitting/underfitting
Simple ways of spotting overfitting/underfitting in four different categories: 
1. **Underfitting** — Validation and training error high. 
2. **Overfitting** — Validation error is high, training error low. 
3. **Good fit**— Validation error low, slightly higher than the training error.
4. **Unknown fit** - Validation error low, training error 'high' I say 'unknown' fit because the result is counter intuitive to how machine learning works.

<a id='communication'></a>
## 6. Communication

**Definition:** In Data Science communication involves interpreting the model, explaining why a model made a particular decision, discovering the hidden patterns, making proper documentations and deploying the model to production.

#### Interpret your model
* Model explainablity is very crucial. We need to disect the logic behind a models prediction.
* We may need to explain how the model works to non-technical people.

#### Discover the hidden patterns
* Visualize the tree models. Extract the rulesets.
* Discover the important features.
* Extract the important feature coefficients.

#### Make documentation
* Making documentation of each steps is very important.
* It makes our work easy-to-share, readable and understandable.

#### Deploy your model to production
* Choosing the right platform for deploying model.
* Keeping scalability in mind.
* Choose fast and scalable APIs.
* Monitor and A/B Test your models.

### An interesting stats on the time expenditure of a Data Scientist on different sections of the data science pipeline.

<img src="imgs/time-doing.jpg" alt="drawing" width="500">

<h5 style="color:red;display:inline;">**We will strictly follow this pipeline in the upcoming practice and capstone projects.</h5>