# **Unit 03 - General Concepts**

## **3.1 Typical Flow**

Below is the typical flow when you are trying to solve a problem. The sequence is not hard-set and you almost always iterate several times back and forth.

1. **Research/Requirement specification:**
    - Know your domain. Market research to explore possible benefits of predictive analytics to end users and revenue it could generate.
    - A specific problem that needs solving by a stakeholder. 
    - You are just given a dump of data that your stakeholder thinks can provide value. Look at it from the big picture. Understand what the stakeholder does and how the data is generated. Identify gaps that can be filled or efficiency that can be brought about.


2. **Formalize your Objective:** At this stage you must be clear what is it that you want to solve. It helps if you make a custom checklist. Ask yourself every possible question you can think of.
    - What is the goal of the project? What are we trying to solve?
    - Who will benefit from this?
    - What is the current solution to the problem you are solving? How good/bad is it?
    - Do you have data? How can you collect it? Where is it stored?
    - Does your data have labels? (Supervised/Un-supervised/Semi-supervised) If not, can you or somebody help you with labelling?
    - Is it a classification task? Regression task? Something else?
    - Does it require batch learning or online learning?
    - What is the measure of performance? What would be deemed acceptable by the stakeholder?
    - What are the assumptions made?
    - Many more based on your problem ...
    
    
3. **Get your data**: 
    - Your stakeholder may provide you the data - files or a database.
    - You may have to gather it yourself. Explore different sources. Conduct surveys.
    - Or both if either are not sufficient !
    - Your models will be built on what we call `training set`. This is typically 70-80 % for most datasets. The remaining data is called the `test set`. We do not touch this till the end because we want the model to generalize better to new unseen data.
    
    
4. **Exploratory Analysis**: 
    - Use only the `training set`. Do not look at the `test set`. This to avoid what we call `snooping bias`. 
    - Understand your data. 
    - Know the statistics and data distributions. 
    - Visualize. Make inferences on what the data is telling you. 
    - Make sanity checks. Check quality of data.
    - Identify missing data and think of strategies best to deal with it.
    - Revisit and fine tune your problem formulation based on what you infer.
    - Preprocess the data in a way it can be further used for modelling
    
    
5. **Training a model**: 
    - Depending on your understanding the data, pick potentially suitable models and train it. 
    - Fine tune models to get best results. 
    - Analyze and interpret the results of your model. Evaluate the performance on the `test set`. 
    
    
6. **Deploy your solution**: This is case dependent. 
    - Your end goal could be a report to show business executives what decisions they should take. 
    - It could be a application which will benefit the end user.

## **3.2 Exploratory Data Analysis**

It is important for us to understand the data we work with. We should know it better than the model we try build with it. This is essential so that you understand why the model predicts something a certain way.

Data can be -
- **Structured**: It is in tabular form with `m` rows and `n` columns. The rows are called as `examples`. The columns are called as `features`.
- **Unstructured**: There is no defined structure. You cannot directly apply an algorithm without some preprocessing. Eg: Text documents, images.

#### **Objectives of an EDA**
- Identify the types of data. 
- Obtain descriptive statistics. 
- Identify quirks or inconsistencies in the data.
- Know the distributions of the data.
- Predictive power of each of the features. To see if the features are correlated with the target variable.
- Most importantly, to understand what you are trying to solve and whether you can explain it with your data.

#### Types of data
- Categorical data:
    - Nominal
    - Ordinal
- Continuous data:
    - Interval:
    - Ratio:

### 3.1 Descriptive / Univariate Analysis

Univariate Analysis
- Central Tendency
    - mean
    - median
    - mode
- Variability
    - Range
    - Variance
    - Interquartile range

- Categorical Variables
    - Bar chart
- Continuous Variables
    - Bar chart (not good)/density chart
    - Box plot
    
### 3.2 Correlation Analysis
- Categorical and Categorical
- Continuous and Continuous
- Categorical and continuous

Outcome - 
- Determine which features have predictive power
- Determine redundant features



## 3.3 Model Training

### Model Evaluation

### Bias and Variance


### Overfitting and underfitting

### Regularization


### Model Evaluation