# Final Report
## Classification Analysis
### Brandyn Waterman, 3/14/2022, Innis Cohort

Good afternoon! We begin with imports that are needed for operating this notebook:

In [4]:
# Calculations and df manipulation
import pandas as pd
import numpy as np

# Visulaizations
import seaborn as sns
import matplotlib.pyplot as plt

# Math & Statistics
from scipy import stats
import statistics
import math

# SQL access
from env import host, user, password

# Imported modules acquire.py and prepare.py
from acquire import get_telco_data
from prepare import prep_telco

# sklearn suite for modeling and analysis
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.tree import export_graphviz, export_text
from sklearn.metrics import classification_report, confusion_matrix, recall_score, precision_score
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

## Overview:
The purpose of this project is to help reduce churn of customers at Telco. This report will go over the steps that were needed to obtain our **goals** of:
- Identifying key drivers (acquire, prepare, explore)
- Creating classification models for churn prediction (model)
- Providing recommendations and solutions based on the information learned (summary)


### Planning:
The first step in this journey is to establish an initial plan and question set to guide our interactions with the data. 

Some of the initial questions relating to the data are:
- What is our current baseline of churn?
- How does plan type (service) impact churn?
- How does internet type impact churn?
- Do any demographic attributes impact churn?
- Does cost impact churn?
- Does tenure impact churn?

For business purposes:
- How much is this churn costing Telco?


### Acquire:
Under the hood our acquire module is making use of get_db_url() to access the SQL server with our credentials, and then get_telco_data() to query the SQL database for our data. 

Our query selects from customers, internet_service_types, contract_types, and payment_types tables from the 'telco-churn' database. It then converts the SQL response to a dataframe, and that dataframe into a local .csv file. (If the file already exists locally the function checks for this prior to the SQL query) The dataframe is then returned. 


In [3]:
telco_data = get_telco_data()

Using cached csv


### Prepare:
Under the hood our prepare module uses telco_split() to take in a dataframe and return three dataframes: train, validate, and test. These are a 56%, 24%, and 20% split of the prepared dataframe, respectfully. 

The prep_telco() function takes in our acquired dataframe and cleans it for use. 

The order of steps are as follows:

To ensure we have no duplicates in our data: df = df.drop_duplicates(inplace=True)

To remove some redundant columns (internet_service_type_id, contract_type_id, payment_type_id): df = df.drop(columns=['internet_service_type_id', contract_type_id', 'payment_type_id']) 

To fix the total_charges columns: df.total_charges = df.total_charges.replace(' ', np.nan).astype(float)
- The issue we had was total_charges being the wrong datatype and having empty strings instead of NaN assignments

To address these **missing values**: df.dropna(inplace=True)
- In total there were 11 rows of missing total_charges values. These were due to the tenure of these customers being 0. Since this is a very small portion of the total dataset they were dropped. They did not have enough tenure to be considered relevant and they were causing missing values in the data. 

The data was then separated for categorical columns for encoding by checking if the dtype is 'O' (object): cat_cols = [col for col in df.columns if df[col].dtype == '0']
- We want to ensure that our customer_id column does not get encoded so we remove it before the next step: cat_cols.remove('customer_id')

To iterate through our categorical columns and encode them: 
```for col in cat_cols:
    dummy_df = pd.get_dummies(df[col],
    prefix = df[col].name,
    drop_first = True,
    dummy_na = False)
    df = pd.concat([df, dummy_df], axis=1)
    df = df.drop(columns=col)
    ```
- This will create dummy columns, concat them to our dataframe, and drop the now redundant column.

After our data is encoded the telco_split() function is utilized and our train, validate, and test dataframes are returned. 

In [6]:
train, validate, test = prep_telco(telco_data)

### Explore:
We now want to try and delve out as many of the questions we asked initially, and expand on any insights that present themselves from the data.