# Data Science

## 1. Understanding the business case

### 1.1 Understanding the business domain

- ___understand the data dictionary___
- check the work of other Data Scientists
- talk to stakeholders
- search
  - [Google](https://www.google.com/)
  - [Google Scholar](https://scholar.google.com/)
- Company website
- Competitor websites
- Company intranet
- [arxiv](https://arxiv.org/)


### 1.2 Asking business relevant questions / Problem statement

- Regression?
- Classification?
- Both?

### 1.3 Defining a metric of success

## 2. Data Mining

Loading data into a Python runtime using:

- [Pandas](https://pandas.pydata.org/)
- [Apache Spark](https://spark.apache.org/)
- [Apache Arrow](https://arrow.apache.org/) for compatibility between different runtimes.

From different data sources:
- CSV
- JSON
- [Parquet](https://parquet.apache.org/)
- [Avro](https://avro.apache.org/)
- AWS S3
- HTML (via Web Scraping)

#### Example: Loading Parquet with Apache Spark/PySpark 

Loading `data/drug_consumption.parquet` into a `pyspark.sql.SparkSession` instance.

```python
from pyspark.sql import SparkSession

name = 'drug_consumption'
spark = SparkSession.builder.appName(name).getOrCreate()

file = spark.read.parquet(f"data/{name}.parquet")
file.createOrReplaceTempView(name)

query_result = spark.sql(f"SELECT count(id) FROM {name}")
query_result.show()
```

#### Example: Loading CSV and Parquet with Apache Arrow Example

Example for a utility function which loads a CSV file into a `pandas.DataFrame`.

```python
import pandas
import pyarrow.csv as csv
import pyarrow.parquet as pq


def read_csv(path: str) -> pandas.DataFrame:
    """
    Loading a CSV file from the given path.
    
    :param path: The path to a CSV file.
    :returns: A DataFrame containing the CSV data.
    :rtype: pandas.DataFrame
    """

    table = csv.read_csv(path)
    return table.to_pandas()


def read_parquet(path: str) -> pandas.DataFrame:
    """
    Loading a Parquet file from the given path.
    
    :param path: The path to a Parquet file.
    :returns: A DataFrame containing the Parquet data.
    :rtype: pandas.DataFrame
    """

    table = pq.read_table(path)
    return table.pandas()
```

## 3. Data Cleaning

- read and understand data dictionary
- understanding features of the dataset
  - column names/predictors/independent variables
  - dependent/predicted variable
- understanding the rows/observations of the dataset
- shape of the dataset
  - too small
    - get more data
    - synthetic data generation
    - search for additional sources
  - too large
    - do Principal Component Analysis (part of _Feature Engineering_) when too many features exist
    - take random samples
- dealing with dublicates
  - remove dublicates
  - for anomoly detection dublicates would be kept in they help to identify patterns over time

### 3.1 Data Manipulation

- map data types
  - numerical
  - categorical
  - same scale (time, corrency)
- finding missing values
- finding dublicates
- setting dummy variables
- type conversions
- renaming columns
- dropping obviously redundant or obsolete columns
  - there might be good to reasons to not delete columns
- handle missing data
  - impute: i.e. take the `mean`, `meadian` on some other value
  - delete the feature or observation
  - enter dummy for missing values, i.e. for a categorical add extra dummy column for missing values. Say 1 Hot encoding where we put a `1` when a value is present, `0` else.
- create _dummy_ variables for categorical values
- ensure that for same feature categorical variables are encdeded consistently

#### Examples: Mapping Data Types with Pandas

```python
import pandas as pd


def column_to_datetime(df: pandas.DataFrame, column_name: str) -> pandas.DataFrame:
    """
    Coverting all values of a columns to 

    :param df: The source pandas.DataFrame.
    :param column_name: The name of the column to transform.
    :returns: The pandas.DataFrame passed as argument.
    :rtype: pandas.Timestamp
    """
    df[column_name] = df[column_name].apply(pd.to_datetime)
    return df
```

## 4. Data Exploration

- summary statistics:
  - [`pandas.DataFrame.descibe()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)
    - distribution of numerical features
- using visualization for exploratory data analysis
  - libraries/packages
    - [matplotlib](https://matplotlib.org/)
    - [seaborn](https://seaborn.pydata.org/)
    - [bokeh](https://docs.bokeh.org/)
    - [plotly](https://plot.ly/)
- using SQL for distinct queries on the data to get a more granular look on data
- check outliers from distributions of the plot
- check for normality (Normal distribution)
- advanced plots to gain more insights on the data
- heatmap for visualizations of correlations

## 5. Feature Engineering

- ___dimensionality reduction, Principal Comopnent Analysis (PCA)___
- transforming/scaling skewed, continuous variables to normally distributed variables
  - MinMaxScaler
  - StandardScaler
- extract from features, say from `name` feature extract `gender`
- regrouping old categories to create new categories
- dummy variables for categories we want to use
- impute values for features (this can be a part of data cleaning as well as feature engineering)
- assign weights to features denoting feature importance

## 6. Predictive Modelling

- choose the models to use that are appropriate for the current challenge
  - resources identified Business Understanding
  - in Kaggle winners use ___Stacking___ to get best performance/evaluation, but in the real world this is usually not the case
- preparation for predictive modelling
- train/test split (90/10, 70/30, 80/20 ...)
- balance the dataset
- what is the model baseline performance metric?
  - naive model should perform than a random guess
- run the model
  - applying PCA
  - GridSearch
  - KFold cross validation
- evaluate model performance
  - confusion metrics
  - classification report

## 7. Data Visualization