# 90 Minutes To Machine Learning

## Why are we here?
1. Intro to the Codeup experience
2. Big Picture overview of Data Science
3. Intro to Machine Learning covering:
    - Data visualization
    - Python ML libraries
    - Building a predictive model
    - Evaluating how well a predictive model performs

## Why Codeup?
- Focus on student outcomes
- Placement services and quality of network
- Immersion works. Full-time, live instruction for 5 months works.
- Projects simulate the work environment from real world data to presenting findings to stakeholders

## What is Data Science?
- Interdisciplinary applied science intersecting programming, statistics, and domain expertise
- The application of the scientific method of hypothesis -> experiment -> analyze -> repeat to analyze and infer outcomes from data.
- A broad description of approaches ranging from business analysis and visualizations to machine learning and deep neural network analysis.
![](drawn_ds_venn_diagram.png)

## What is Machine Learning?
- Machine Learning is the process of using previous data as the fuel for determining rules for making predictions of outcomes from future data.
-  Classical programming takes business rules and data to produce answers. Ex. TurboTax software.
- Machine learning takes in data (and sometimes answers/labels for some data) and produces rules or predictions for future data. The example here is text message autocomplete.

<img src="classical_programming_vs_machine_learning.jpeg" width=500>


## Where does Machine Learning Fit Into Data Science?

![](example_data_science_project.png)

The Data Science Pipeline:
- Planning
- Data Acquisition
- Data Cleaning
- Exploratory Data Analysis (visualization, hypothesis testing)
- Modeling
- Presenting Findings

## Types of Machine Learning and Other Skills Covered in Codeup
![](machine_learning_methods.png)

## Challenges of Machine Learning
- Garbage in, garbage out
- Insufficient quantity of data
- Nonrepresentative data
- Poor quality data
- Overfitting or underfitting
- Bias in, bias out


In [None]:
# Data Processing and Data Cleaning Libraries
import pandas as pd
import numpy as np

# Vizualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn preprocessing
from sklearn.model_selection import train_test_split

# modeling
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# model evaluation
from sklearn.metrics import classification_report

In [2]:
def split_data(df, stratify_by=None):
    """
    3 way split for train, validate, and test datasets
    To stratify, send in a column name
    """
    
    if stratify_by == None:
        train, test = train_test_split(df, test_size=.2, random_state=123)
        train, validate = train_test_split(df, test_size=.3, random_state=123)
    else:
        train, test = train_test_split(df, test_size=.2, random_state=123, stratify=df[stratify_by])
        train, validate = train_test_split(df, test_size=.3, random_state=123, stratify=train[stratify_by])
    
    return train, validate, test

## Homework