# Welcome to the Course

This course consists of two lessons:

1. The Data Science Process
2. Communicating to Stakeholders

![CRISP-DM Data Science Process](crisp-dm.jpg "CRISP-DM")


## Stakeholders in Your Work as a Data Scientist

**Data Engineers**
Data scientists often depend greatly on the work of data engineers. Data scientist and data engineering teams need to collaborate closely to provide input on, and understand, each team's needs and constraints, to ensure they can meet the goals of the business.

The data engineering team is responsible for gathering desired data, creating pipelines for ongoing ingestion of data, and structuring and storing the data in ways that prepare it for usage by the data science team, business analysts, and others. In general, data engineers are said to be responsible for ETL - the Extract, Transform, and Load processes needed for the data.

**Business Stakeholders**
Data scientists need to regularly interface with members of the business unit to understand what problems they are trying to solve, what questions and challenges they are encountering, and what data are important to them. These stakeholders may work in roles like product managers or might even be company executives.

**Communication**
The most important thing to keep in mind with all of these stakeholder groups is to communicate early and often!

**Further Optional Reading**
Here's an introductory piece on [why data scientists and data engineers need to work together](https://www.3agsystems.com/blog/ds-and-de).

And here is a useful blog post from a data scientist at Uber on [How to Work with Stakeholders as a Data Scientist](https://towardsdatascience.com/how-to-work-with-stakeholders-as-a-data-scientist-13a1769c8152).



## The CRISP-DM Process (Cross Industry Process for Data Mining)

This lesson focuses on helping you go through CRISP-DM in practice from start to finish. Even when we get into the weeds of coding, try to take a step back and realize what part of the process you are in, assure that you remember the question you are trying to answer, and what a solution to that question would look like.


The first two steps of CRISP-DM are:

1. **Business Understanding** - This means understanding the problem and questions you are interested in tackling in the context of whatever domain you're working in. Examples include:

- How do we acquire new customers?
- Does a new treatment perform better than an existing treatment?
- How can we improve communication?
- How can we improve travel?
- How can we better retain information?

2. **Data Understanding** - At this step, you need to move the questions from Business Understanding to data. You might already have data that could be used to answer the questions, or you might have to collect data to get at your questions of interest.

### The CRISP-DM Process - Prepare Data
We have now defined the questions we want to answer and looked through the data available to find the answers - that is, we have looked at the first two steps here:

1. Business Understanding

2. Data Understanding

We can now look at the third step of the process:

3. Prepare Data

Luckily Stack Overflow has already collected the data for us. However, we still need to wrangle the data in a way for us to answer our questions. The wrangling and cleaning process is said to take 80% of the time of the data analysis process. You will see that will hold true through this lesson, as a majority of the remaining parts of this lesson will be around basic data wrangling strategies.

We will discuss the advantages and disadvantages of the strategies discussed in this lesson.


When looking at the first two business questions we explored:

- How to break into the field?
- What are the placement and salaries for those who attended a coding bootcamp?
we did not need to do any predictive modeling. We only used descriptive and a little inferential statistics to retrieve the results.

Therefore, all steps of CRISP-DM were not necessary for these first two questions. CRISP-DM states 6 steps:

1. Business Understanding

2. Data Understanding

3. Prepare Data

4. Data Modeling

5. Evaluate the Results

6. Deploy

For these first two questions, we did not need step 4. In the previous notebooks, you performed steps 3 and 5 without needing step 4 at all. A lot of the hype in data science, artificial intelligence, and deep learning is integrated into step 4, but there are still plenty of questions that can be answered without using machine learning, artificial intelligence, or deep learning.

### All Data Science Problems Involve

1. Curiosity.
2. The right data.
3. A tool of some kind (Python, Tableau, Excel, R, etc.) used to find a solution (You could use your head, but that would be inefficient with the massive amounts of data being generated in the world today).
4. A well communicated or deployed solution.

### Extra Useful Tools to Know But That Are NOT Necessary for ALL Projects

- Deep Learning
- Fancy machine learning algorithms

With that, you will be getting a more in-depth look at these items, but it is worth mentioning (given the massive amount of hype) that they do not solve all the problems. Deep learning cannot turn bad data into good conclusions. Or bad questions into amazing results.


When looking at the first two business questions we explored, we did not need to do any predictive modeling.

Therefore, all steps of CRISP-DM were not necessary for these first two questions. The process would look closer to the following:

1. Business Understanding

2. Data Understanding

3. Prepare Data

4. Evaluate the Results

5. Deploy

However, our approach will need to be different for the last two business questions we'll explore:

- How well can we predict an individual's salary? What aspects correlate well to salary?
- How well can we predict an individual's job satisfaction? What aspects correlate well to job satisfaction?

For these, we will need to use a predictive model. We will need to pick up at step 3 to answer these two questions, so let's get started. The process for answering these last two questions will follow the full 6 steps shown here.

1. Business Understanding

2. Data Understanding

3. Prepare Data

4. Model Data

5. Evaluate the Results

6. Deploy


There are two main 'pain' points for passing data to machine learning models in sklearn:

1. Missing values
2. Categorical values

Sklearn does not know how you want to treat missing values or categorical variables, and there are lots of methods for working with each. For this lesson, we will look at common, quick fixes. These methods help you get your models into production quickly, but the thoughtful treatment of missing values and categorical variables should be done to remove bias and improve predictions over time.

Three strategies for working with missing values include:

1. We can remove (or “drop”) the rows or columns holding the missing values.
2. We can impute the missing values.
3. We can build models that work around them, and only use the information provided.


Though dropping rows and/or columns holding missing values is quite easy to do using NumPy and pandas, it is often not appropriate.

Understanding why the data is missing is important before dropping these rows and columns. In this video, you saw a number of situations in which dropping values was not a good idea. These included

1. Dropping data values associated with the effort or time an individual put into a survey.
2. Dropping data values associated with sensitive information.

In either of these cases, the missing values hold information. A quick removal of the rows or columns associated with these missing values would remove missing data that could be used to better inform models.

Instead of removing these values, we might keep track of the missing values using indicator values, or counts associated with how many questions an individual skipped.

In the last video, you saw cases in which dropping rows or columns associated with missing values would not be a good idea. There are other cases in which dropping rows or columns associated with missing values would be okay.

A few instances in which dropping a row might be okay are:

1. Dropping missing data associated with mechanical failures.
2. The missing data is in a column that you are interested in predicting.

Other cases when you should consider dropping data that are not associated with missing data:

1. Dropping columns with no variability in the data.
2. Dropping data associated with information that you know is not correct.

In handling removing data, you should think more about why is this missing or why is this data incorrectly input to see if an alternative solution might be used than dropping the values.

One common strategy for working with missing data is to understand the proportion of a column that is missing. If a large proportion of a column is missing data, this is a reason to consider dropping it.

There are easy ways using pandas to create dummy variables to track the missing values, so you can see if these missing values actually hold information (regardless of the proportion that are missing) before choosing to remove a full column.

### Imputation

Imputation is likely the most common method for working with missing values for any data science team. The methods shown here included the frequently used methods of imputing the mean, median, or mode of a column into the missing values for the column.

There are many advanced techniques for imputing missing values including using machine learning and Bayesian statistical approaches. This could be techniques as simple as using k-nearest neighbors to find the features that are most similar, and using the values those features have to fill in values that are missing or complex methods like those in the very popular [AMELIA library](https://cran.r-project.org/web/packages/Amelia/Amelia.pdf).

Regardless of your imputation approach, you should be very cautious of the bias you are incorporating into any model that uses these imputed values. Though imputing values is very common and often leads to better predictive power in machine learning models, it can lead to overgeneralizations. In extremely advanced techniques in Data Science, there can be [ethical implications](https://intelligence.org/files/EthicsofAI.pdf) of such bias. Machines can only 'learn' from the data they are provided. If you provide biased data (due to imputation, poor data collection, etc.), it should be no surprise that you will achieve results that are biased.


### Working with Categorical Variables
A common method for encoding categorical variables is with 1's and 0's. For example, sklearn offers the get_dummies method to perform this type of conversion.

This approach has advantages and disadvantages, and will not be appropriate for every data context.

### Dropping Columns of a Matrix When Using LinearRegression in Sklearn
Sometimes we may want to drop a column of our X matrix to assure it is full rank. This is not true using LinearRegression within sklearn, because there is a ridge (or L2 penalty used by default). However, dropping the columns would also be okay, it is just not required, as it is with OLS without a penalty.

### Overfitting
Overfitting is a common problem when our model does not generalize to data it has not seen before. Assuring you build models that not only work for the data the model was trained on, but also generalize to new (test) data, is key to building models that will be successful to deploy and that will become successful in production.

### Evaluate & Deploy 
Two techniques for deploying your results include:

Automated techniques built into computer systems or across the web.
Communicate results with text, images, slides, dashboards, or other presentation methods to company stakeholders.
To get some practice with this second technique, you will be writing a blog post for the project and turning in a Github repository that shares your work.

As a data scientist, communication of your results to both other team members and to less technical members of a company is a critical component.



### Recap of the CRISP-DM Process
1. Business Understanding

These were the questions we decided to explore in our dataset:

How do I break into the field?
What are the placement and salaries of those who attended a coding bootcamp?
How well can we predict an individual's salary? What aspects correlate well to salary?
How well can we predict an individual's job satisfaction? What aspects correlate well to job satisfaction?
2. Data Understanding

Here we used the StackOverflow data to attempt to answer our questions of interest. We did 1. and 2. in tandem in this case, using the data to help us arrive at our questions of interest. This is one of two methods that is common in practice. The second method that is common is to have certain questions you are interested in answering, and then having to collect data related to those questions.

3. Prepare Data

This is commonly denoted as 80% of the process. You saw this especially when attempting to build a model to predict salary, and there was still much more you could have done. From working with missing data to finding a way to work with categorical variables, and we didn't even look for outliers or attempt to find points we were especially poor at predicting. There was ton more we could have done to wrangle the data, but you have to start somewhere, and then you can always iterate.

4. Model Data

We were finally able to model the data, but we had some back and forth with step 3. before we were able to build a model that had okay performance. There still may be changes that could be done to improve the model we have in place. From additional feature engineering to choosing a more advanced modeling technique, we did little to test that other approaches were better within this lesson.

5. Results

Results are the findings from our wrangling and modeling. They are the answers you found to each of the questions.

6. Deploy

Deploying can occur by moving your approach into production or by using your findings to persuade others within a company to act on the results. Communication is a very important part of the role of a data scientist.


## README Showcase
Let's take a look at some of the qualities of good README files. In the last video, you saw that a good README should have:

1. Installations

2. Project Motivation

3. File Descriptions

4. How to Interact with your project

5. Licensing, Authors, Acknowledgements, etc.

