# 90 Minutes To Machine Learning

## Why are we here?
1. Intro to the Codeup experience
2. Big Picture overview of Data Science
3. Intro to Machine Learning concepts and tools including:
    - Data acquisition and preparation
    - Data visualization
        - 
    - Building a predictive model w/ Scikit-Learn
    - Evaluating how well a predictive model performs

## Why Codeup?
- Focus on student outcomes
- Placement services and quality of network
- Immersion works. Full-time, live instruction for 5 months works.
- Projects simulate the work environment from real world data to presenting findings to stakeholders

## What is Data Science?
- Interdisciplinary applied science intersecting programming, statistics, and domain expertise
- The application of the scientific method of hypothesis -> experiment -> analyze -> repeat to analyze and infer outcomes from data.
- A broad description of approaches ranging from business analysis and visualizations to machine learning and deep neural network analysis.
![](drawn_ds_venn_diagram.png)

## How Does Data Science Relate to Traditional Software and Data Analysis?
![](data_science_venn_diagram_with_overlapping_disciplines.png)

## What is Machine Learning?
- Machine Learning is the process of using previous data as the fuel for determining rules for making predictions of outcomes from future data.
-  Classical programming takes business rules and data to produce answers. Ex. TurboTax software.
- Machine learning takes in data (and sometimes answers/labels for some data) and produces rules or predictions for future data. The example here is text message autocomplete.

<img src="classical_programming_vs_machine_learning.jpeg" width=500>

## Where does Machine Learning Fit Into Data Science?

![example data science pipepine and product](example_data_science_project.png)


## Challenges of Machine Learning
- Garbage in, garbage out
- Insufficient quantity of data
- Nonrepresentative data
- Poor quality data
- Overfitting or underfitting
- Bias in, Bias out:
    - [Cognitive Biases](https://en.wikipedia.org/wiki/List_of_cognitive_biases) arise from being human.
    - [Statistical Biases](https://en.wikipedia.org/wiki/Bias_(statistics)) arise from our methodologies.
- Whatever Machine Learning "learns", it will keep doing. There is no cognition or intelligence, only pattern recognition and optimization.

## What kinds of questions can Data Science methods answer?
- How Many or How Much of something (Regression)
- Is this observation A or B, or C or D or E... (Classification)
- What groupings exist in the data already (Clustering)
- What should we expect to happen next? (Time Series Analysis)
- Is this weird? (Anomaly Detection)

## Types of Machine Learning and Other Skills Covered in Codeup
![machine learning methods taught at Codeup](machine_learning_methods.png)

## What kind of ML will we doing today?
- We'll be using a decision tree classifier to predict whether or not we should expect employees to quit a company.
- Classification machine learning is used all the time for such things as:
    - Facial recognition
    - Handwriting recognition and conversion to typed text
    - Recommendation engines
- Classification is a "supervised learning" type of machine learning. That means we train the algorithm on existing data to learn a rule, a recognized pattern, to apply to future data.

## How does a decision tree work:
- Decision Trees work like playing 20 questions. 
- Classification algorithms use training data to measure the distance between points or the distance around boundaries between points.
- By "learning" the pattern recognition around sets of points, the classifier produces a "decision rule" to use to apply to classify new incoming data.

#### Consider this diagram of a decision tree
![decision tree diagram](decision_tree_diagram.png)

In [1]:
# Data Processing and Data Cleaning Libraries
import pandas as pd
import numpy as np

# Vizualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn preprocessing
from sklearn.model_selection import train_test_split

# modeling
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# model evaluation
from sklearn.metrics import classification_report

In [2]:
def split(df, stratify_by=None):
    """
    3 way split for train, validate, and test datasets
    To stratify, send in a column name
    """
    
    if stratify_by == None:
        train, test = train_test_split(df, test_size=.2, random_state=123)
        train, validate = train_test_split(df, test_size=.3, random_state=123)
    else:
        train, test = train_test_split(df, test_size=.2, random_state=123, stratify=df[stratify_by])
        train, validate = train_test_split(df, test_size=.3, random_state=123, stratify=train[stratify_by])
    
    return train, validate, test

In [3]:
pd.read_csv("data.csv").head()

Unnamed: 0.1,Unnamed: 0,Attrition,Age,MonthlyIncome,Gender,Education,WorkLifeBalance,JobSatisfaction,PercentSalaryHike,BusinessTravel
0,0,Yes,41,5993,Female,2,1,4,11,Travel_Rarely
1,1,No,49,5130,Male,1,3,2,23,Travel_Frequently
2,2,Yes,37,2090,Male,2,3,3,15,Travel_Rarely
3,3,No,33,2909,Female,4,3,3,11,Travel_Frequently
4,4,No,27,3468,Male,1,3,2,12,Travel_Rarely


In [4]:
sns.boxplot(x=df.Attrition, y=df.MonthlyIncome, data=df)

NameError: name 'df' is not defined

In [None]:
sns.boxplot(x=df.Attrition, y=df.Age, data=df)

In [None]:
sns.boxplot(x=df.Attrition, y=df.DistanceFromHome, data=df)

In [None]:
sns.boxplot(x=df.Attrition, y=df.MonthlyIncome, data=df)

In [None]:
sns.boxplot(x=df.Attrition, y=df.NumCompaniesWorked, data=df)

## Homework