# Introducton to Python Session 8 

## The Data Science Workflow 

Our last session wrapped up Chapter 4 of the Social Network Analysis book. Today's session will be the start of something new, specifically the modeling workflow. Todays first lesson will be more conceptual as we will cover some of the more common terms used with modeling. The data science workflow can be broken down into several subsections. We are going to go into detail regarding what each of these sub sections stands for. We will cover the first three sections today.

In terms of material, we will be using the "Data Science for Business" book. It is a popular book that is used in many data science programs. I used it myself for my Data Management class. 

* Problem Identification 
* Data Gathering 
* Data Cleaning 
* Exploration 
* Feature Engineering 
* Normalizations and Transformations 
* Model
* Validate 
* Deploy/Interpret 

## Problem Identification 

This is typically considered to be the most difficult step of the data science workflow. A well defined problem statement can make or break the entire project. This step determines if you have something that models garbage in or garbage out vs something with real insight. Here are some guidelines to identifying the problem. 

 - What are you trying to solve? 
 
 Are you trying to predict an outcome? Are you trying to classify data into groups? Are you trying to measure performance? Are you trying to capture the relationship between two entities? Are we trying to model some segment? There are more questions that we can come up with but this sort of brain storming is crucial. Once you have an idea of what it is you want to solve, then the next step is to source your data. 
 
## Data Gathering 

Sourcing your data is where domain knowledge comes into play. You want to be able to gather data from one or more sources that can best capture the variables related to the problem you are trying to solve. This step will most likely have you interacting with a database and using some variation of SQL. It's less uncommon to be able to find a csv/excel workbook that already has clean data and variables. It will be your job to create this clean data and find the right variables. During the data gathering phase, you should develop an intuition to what kind of data structure you want to work with. By data structure, this includes explicitly defined rows and columns, data format (JSON, TXT, CSV...etc.), and storage. By defined rows, I mean knowing what each row of your data should represent? Does each row represent a user, a timestamp, or an experiment? Can rows be unique or duplicated? If rows are duplicated, did you impliment some sort of nesting for the other variables in your data frame? 

## Data Cleaning 

This is where 70-80 percent of time will be spent. Real data is never clean and requires some sort of engineering in order to make it useable. The overall goal in data cleaning is to maximize the accuracy of a dataset without getting rid of information. Most analysts start by examining the distribution of missing data by column. Treating missing data is debated frequently in data science forums. Many people argue that a variable should be deleted if it is missing more than 30 percent of their values, however counter arguments insist that variables could be too valuable to simply delete from the data. This is when good judgement and creativity shine through. What are some ways we can deal with this? 

Imputation is a common method that allows us to fill in missing data. If a variable is continous, then missing data can be filled in using the median or mean of the non-missing values. Mean vs Median can be determined after you get a sense of the skewness in your data. A general rule of thumb is that symmetrically distributed values can be imputed with the mean, otherwise use the median. This is where knowledge of statistics comes in handy regarding how to handle different distributions. 

Imputation can also be done on categorical values. The mode of a categorical column is an easy approach to imputing categorical values if the missing data threshold is "low", however if we are missing half our values and use the mode of the non-missing values, we will end up polluting the variable so there needs to be a balance when imputing categorical data. Sometimes, you will find that you might have to manually impute categorical values. You need to lean on your domain knoweledge in this instance to see if its even worth it. 

Another element data cleaning is changing the names of columns. Sometimes we get data with column names such as X1, X2, or Variable_1, Variable_2, etc. There will be some cases where you'll need to inspect the actual data values themselves, for example, lets say we have variable called "budget" but notice that some of our values are negative. Does this make sense? Lets say we have another variable called "length in weeks" and also get some negative numbers. We really can't have negative time. Sometimes there are errors with the actual data collection that are outside our own control and from my experience, they don't account for more than 10 % of the data. Sometimes, we don't see these issues until we begin exploring the data but it is good to keep in mind.  

In terms of variables in columns, there may be instances where we might need to change values all together or potentially roll up some values into less categories. Such examples are when we use CASE WHEN to classify several granular categories into more practical handful of categories. While this is going on, we might also add a label to our data if none has been assigned for supervised learning type problems. A label is a vector which labels a data point as class A, class B...etc. In our cases, our labels could be testgroup and controlgroup. For salesforce, the label is win or losses. 

## Exploration 

Exploration is going to play an important role in driving decisions for the feature selection process. There are several elements that come into play when exploring a data set. We use exploration to find relationships among our variables, patterns, and distributions. We use all of this information gathered to make informed decisions in terms of our modeling process. I will talk about some of the most common elements used in exploration. 

The first is a statistical summary of your columns. The summary allows you to get a sense of the mean, medians, max, mins, and modes for your variables. Both R and python are able to differentiate between data types when producing summary results. Summaries are a great tool to be able to keep note of columns where irregularities might exist. The max and min of a variable can be useful in finding which variables could have unrealistic values. If you see a large difference between the mean and median, it is probably due to some skewness so perhaps you want to keep some transformations in mind. 

Correlations are also performed in the exploration step. We want to be able to see what variables are correlated with each other and what variables are correlated with the outcome variable if dealing with a supervised learning problem. Knowledge of statistics is required to understand how correlations can impact a model. Too many correlated variables will produce multi-colinearity, which simply means that one or more predictor can be represented as a linear combination of one or more other predictors. Modeling multi-colinearity in data will universally produce nonsense. 

During your exploration you will also want to identify and remove variables that may be biased and cause your algorithm to make generalizations on a subset and fail during prediction. This usually comes in the form of finding variables that are perfectly correlated with the outcome variable. This is common in supervised learning. What does this mean? Lets say our outcome variable is a binary 1 or 0. Our data is every potential customer that wants to buy a car online. If they bought one, then their label is 1, otherwise it is 0. If I have a predictor variable such as "car model purchased" then this predictor would be perfectly correlated with our outcome variable, thus causing any sort of modeling to "overfit" and not give anything meaningful. In this scenerio, I wanted to find the most important features that influence a car purchase such as credit history, income, ...etc. 

The next thing you will want to keep in mind during exploration are the distributions of categorical values. By definition, a variable must have more than 1 non-zero value, in other words a variable cannot have the same values for every row. Even if a variable were to have value A 99 times and value B 1 time, this would still give us some problems. This is when you need to consider variance or near zero variance. This actually will play a bigger role in the feature engineering section.

By the end of exploration, a user should know where to apply transforms based on distributions, flag correlated variables, and have a better understanding of relationships in the data. 

## Feature Engineering + Normalizations and Transforms

This step is going to depend on your observations from the exploration step. Feature engineering entails deriving features, applying transformations on features, removing features, and preparing your dataset to perform analysis. Lets break down what some of these things mean. 

There are instances where we can use existing columns to derive new columns that could potentially be informative. Think back to what we did for salesforce. We used the pursuit creation date to derive a feature that would capture loyalty. We defined it as the number of previous quarters where we pitched business to the same advertiser within the same year. This sort of feature engineering is dependent on domain knowledge. 

Transformations are commonly done through normalization. This is useful in the sense when your data has multiple features that have very varied ranges. For example, lets say we have two numerical columns age and income. Age has a range of 0-100 and income has a range of 0-100k. Without normalization, a linear regression would say that income is more important because of the larger range but that is not always true. Normalization helps avoid the problem of one numerical variable given extra importance because of its range. This transformation ensures that both variables in this case are on the same scale. 

Removing features should be done using information obtained in the exploration phase. Some guidelines to remove features are those that are highly correlated with the outcome variable if any. This deals with potentail overfitting. We should also know if any of our variables have zero or near zero variance. If a variable has zero variance, then it is safe to say that it would not offer anything meaningful and would potentially lead us to generalize an outcome based on a subset of data. Lets say we built a model using zero variance variables. If we try to predict outcomes on new data that has the same variables but with higher variance, then our model will not perform well. We can also use pairwise correlations to remove variables. Having too many variables correlated with each other will cause multi-colinearity. This alone is enough to break multiple regression and logistic regression. This is certainly something we want to avoid. 

## Model + Validate

Once our data has been cleaned, we can proceed to modeling. Typically we take our dataset and split it up into a test and training subset. We build the model on the training subset and validate it against the test subset. 80/20 is a common split ratio. Before building models we should be sure of our computational requirements. Some packages requires that categorical data be dummified such as pythons sklearn. This means that each value in a categorical variable becomes its own column with 1's and 0's. If you need to dummify, its best to check the zero variance for your new dummy variables. 

Since we have been looking at this through the lens of supervised learning, we should know enough about which model to use and when to use. you will find yourself grappling with the idea of interpretation vs accuracy. We can check accuracy by seeing how well our model predicted the values in the test subset. However depending on the model and outcome, "accuracy" can come in many forms. 

For linear regression, we typicall use the r squared, root mean square error, or mean absolute percent error. 

For classification, we use accuracy, F1 score, sensitivity, specificity, and confusion matricies. 



Next week, I want to come up with an example of what the first three steps look like while moving on into the feature engineering and modeling. There is a lot to talk about here. 

## Homework 

Read chapter 1 of the Data Science for Business book if you have not done so already.  