<a href="https://colab.research.google.com/github/ragavkumar/Predictive_Modeling/blob/main/%5BPublic%5D_Intro_To_Data_Science_Predictive_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Data Science: Predictive Modeling
## Bank Marketing Data

Today, we will be learning the basics of Predictive Modeling by exploring bank marketing data and using it to answer a key question.

__Question:__ Is a customer a good candidate for a new product we're offering?

__Data Set:__ Information from direct marketing campaigns of a Portuguese banking institution, including:
- General customer background information
- Customer banking information
- Marketing campaign contact information

__Goal:__ Use the data from the past to build a predictive model to help us predict who will be good candidates for our marketing initiatives.


You will need to add some code to complete this notebook.  Follow along with the instructor to find what code to add.  You will add that where the code says "\*\*\* ADD CODE HERE\*\*\*"

Have fun and good luck coding!

> To execute a line or block of code, simply click the "Play" button on the left side or use the keyboard shortcut "Shift + Enter"
> When that code block has actually been executed, the blank brackets will change to have a number inside of them.

## Seed and Target Data
## Importing the packages that we'll need

One of the things that makes Python **great** for data science is all of the different libraries that exist so we don't have to code them from scratch. Tonight we'll be taking advantage of:
- [Numpy](https://numpy.org/) for scientific and mathematical computing
- [Pandas](https://pandas.pydata.org/) for data wrangling and analysis
- [Sklearn](https://scikit-learn.org/stable/) for all things machine learning

In [None]:
# Import the appropriate packages
***ADD CODE HERE***

# machine learning
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

# Packages for rendering our tree.
from IPython.display import Image
import pydotplus
import graphviz

## Import the data
Pandas can work with information from all kinds of data sources. Below, we'll import the data we need from a GitHub URL and read it into a Pandas Dataframe using the Pandas [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

In [None]:
# import data from github
data = pd.read_csv("https://raw.githubusercontent.com/ephs08kmp/predictive_modeling_workshop/master/bank.csv",
                   sep=';')

## Understand the data
Exploratory data analysis (EDA) begins with a solid understanding of the data and where you are starting.  This includes starting with the basics of what is in your dataset.

In [None]:
# Check out the first lines of the data set
data.head()

Quick Notes about some of the variables
- Customer Attributes
  - **default**: has credit in default? (binary: "yes","no")
  - **balance**: average yearly balance, in euros (numeric) 
  - **housing**: has housing loan? (binary: "yes","no")
  - **loan**: has personal loan? (binary: "yes","no")
- Related with the last contact of the current campaign:
  - **contact**: contact communication type (categorical: "unknown","telephone","cellular") 
  - **day**: last contact day of the month (numeric)
  - **month**: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  - **duration**: last contact duration, in seconds (numeric)
- Other attributes:
  - **campaign**: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  - **pdays**: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  - **previous**: number of contacts performed before this campaign and for this client (numeric)
  - **poutcome**: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")
- Output variable (desired target AKA "the thing we're ultimately trying to predict"):
  - **y** - has the client subscribed to our new product? (binary: "yes","no")

In [None]:
# Checking the size of our data (rows, columns)
data.shape

In [None]:
# Get a concise summary of the dataset
data.info()

In [None]:
# Understand the basic statistical details of the data set
data.describe()

## Clean the Data

Now that we understand the basics of what's in the data, we need to clean the data before it's ready for modeling.  We won't spend as much time on this today as you would during a full-scale project, but some things you could clean from the data as an extra **challenge**:
- Remove duplicate data
- Fill in missing data (there isn't any missing data in this dataset)
- Identify and clean outliers
- Manipulate data

> This data is not ready for machine learning because there are both categorical and numerical values, and our model can only interpret numerical values.  We need to turn strings and chars into integers.

### Process the Data
First, a number of variables are currently a binary of "yes" and "no", but those values can't be understood by our model.  Let's use the `.map` function to map them to 0 = "no" and 1 = "yes".

In [None]:
# Map "no" to 0 and "yes" to 1
yn_map = {'no': 0,
          'yes': 1}
data['default'] = data['default'].map(yn_map)
data['housing'] = data['housing'].map(yn_map)
data['loan'] = data['loan'].map(yn_map)

In [None]:
# Check to make sure the change worked 
# by looking at the first 5 rows of data
data.head()

Because we have a number of categorical variables, we need to encode them as numbers.  We can use the pandas [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function to help with this encoding. We will leave job and month out of our model for simplicity.  As a challenge, try adding them back in and see how they impact your model.

In [None]:
# One hot encoding using pandas
X = pd.get_dummies(***ADD CODE HERE***)
y = data['y']

# print out this new formatted data to see what happened
X.head()

Now the columns that were categorical are now numerical.  Feel free to go back and check that these numbers make sense with our original data.

Finally, we'll set 20% of the data points from our dataset aside for assessing the model using the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

![](https://docs.splunk.com/images/thumb/3/3b/TrainTest.png/550px-TrainTest.png)

In [None]:
# Creating training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5280)

**Now our data is ready for modeling!**

### Predictive Modeling
#### Decision Trees
Decision Trees in classification problems are like flow charts where the model takes the data from the past to figure out the best split points to predict what category the data is in.  

[Decision trees](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) are good for unbalanced datasets, but can be prone to overfitting (the model predicts too closely to the training set). 

##### Decision Tree Example
<div>
<img src="https://eloquentarduino.github.io/wp-content/uploads/2020/08/DecisionTree.png" width="500"/>
</div>

First we will create an empty decision tree, then we will fit the decision tree with all of our marketing data as the inputs, and whether or not the customer subscribed to the new product as the output.  Finally, we will render our tree so we can take a look at our decision tree.

In [None]:
# Instantiate your model (start up your model) with an empty decision tree
clf = tree.DecisionTreeClassifier(max_depth=5,          # Number of nodes (split points)
                                  max_features=10,      # Number of features (columns) to consider
                                  min_samples_leaf=20,  # Minimum of samples for a leaf (end point)
                                  random_state=42)      # Controls randomness for consistency

# Fit our decision tree with inputs, target
clf_train = clf.fit(***ADD CODE HERE***)

# Render our tree.
dot_data = tree.export_graphviz(
    clf_train, out_file=None,
    feature_names=X_train.columns,
    class_names=['Not Subscribed', 'Subscribed'],
    filled=True
)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

> **You be the Data Scientist:** What features look important in making the prediction of whether or not a person will subscribe to a new marketing initiative?

While the visual of the decision tree is helpful, tree-based models also provide feature importances, which represent how much each feature (column/variable) impacts the prediction.  Feature importances are values between 0 and 1, with higher values representing features that are more valuable in making predictions.  The feature importance does NOT determine directionality of the prediction.

In [None]:
# Interpreting how the model is making predictions
# Decision trees calculate the importance of features 
# (doesn't speak to directionality)
importance = pd.DataFrame(zip(X_train.columns, clf.feature_importances_), 
                          columns=['Feature', 'Feature Importance'])
importance.sort_values(by='Feature Importance', ascending=False)[:10]

### Evaluating Our Model

Finally, we'll pass our test set values that we set aside through our model to make predictions using the `.predict()` function.  Then you can check to see the inputs and how they impacted the predictions.


In [None]:
# Running our test set through our model to make predictions
y_pred = clf_train.predict(X_test)
# Combining our prediction with the base data
pred_data = X_test.reset_index().join(pd.DataFrame(zip(y_pred, y_test), 
                                                   columns=['Predicted Y', 'Actual Y']))
# Only display the features used to make predictions for first 10 predictions
pred_data[['Predicted Y', 'Actual Y', 
           'duration', 'poutcome_success', 'contact_unknown',
           'marital_married', 'age', 'balance', 'education_tertiary', 
           'previous', 'marital_single','day']][:10]

Based on the base data, do the predictions match with your decision tree?

> **You be the Data Scientist:** What insights can you share about which customers are good targets for new marketing iniatives? 

Those 10 predictions look good, but how did our model do in predicting the rest of the test set? First, let's take a look at the accuracy of the model using the built-in `.score()` function, which looks at the proportion of predictions that were correct out of all of the predictions.  A score closer to 1 is better.

In [None]:
# Calculate the accuracy of the model
accuracy = clf_train.score(X_test, y_test)
print('Accuracy of the test set: ', accuracy)

That's a pretty good score for an initial model.  Since this is a classification model, let's take a look at how well the model predicted each class (yes/no) using a [confusion matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/).  

In [None]:
# Create confusion matrix
plot_confusion_matrix(clf_train, X_test, y_test, normalize='all', values_format='.2%', cmap='Blues');

We can see that the model did a good job of correctly predicting if someone would not be a good candidate for the new marketing initiative, but not as good of a job of predicting if they would be a good candidate.  This can happen when the training data is unbalanced and contains more of one class than the other.

This is also why it's always good to use multiple ways to assess a model to understand the bigger picture of how well your model is making predictions.

## Take Home Challenges
For added practice and to improve your model:
- Think about what other factors would help make better predictions about if someone will be a good candidate for a new marketing initiative. Engineer those in your DataFrame in the **Process the Data** section, then run the rest of your code to see how your decision tree changes.  
- Change the parameters when you instantiate the model to see how the decision tree changes and if it changes any of your predictions
- Add run other classifiers here



# Keep Learning with Thinkful
If you enjoyed today's session and want to take a deeper dive into many of the topics that we covered today like Pandas, SQL, predictive modeling, visualizing your data, and so much more, we'd love to have you join us again!
- Check out more of our webinars at [Thinkful Webinars](https://www.thinkful.com/webinars/)
- Learn more about the [Data Science Flex Course](https://www.thinkful.com/bootcamp/data-science/flexible/)