# Hackathon: Supervised Learning

## Background

You are part of a data analytics team at a bank or insurance company. A business unit has reached out, asking for help with a business problem. They want your team to analyze company data and develop a classification model to support data-driven decisions. You'll get a problem statement, a CSV dataset, and a data dictionary to work with.

## Output

Each team will create a Jupyter notebook that includes your code, outputs (like computed values, tables, and graphics), and markdown cells with clear explanations of your steps and findings. Make sure the notebook is easy to read and understand, even for team members who aren't familiar with Python or data science. You'll need to present your solutions, walking through the final notebook in no more than 15 minutes, leaving an additional 5 minutes for Q&A.

## Step-by-step Guide

1. **Data Import and Initial Inspection:**
   - Start by reviewing the business problem statement to understand its value. Check out the data dictionary.
   - Import your data into a Pandas DataFrame and take a quick look at its structure using Pandas functions.
   - Transform the business problem into a data science problem by answering these questions (state any assumptions if you lack info):
     - **Target Variable:** What do the 0s and 1s mean? What's the business value of predicting the target accurately?
     - **Error Impact:** Which is costlier for the business – a false positive or a false negative? This helps in choosing the right scoring metric for your model.

2. **Exploratory Data Analysis (EDA):**
   - Dive into your data using univariate and multivariate stats and visualization techniques.
   - The goal of EDA is to understand your data and prep for feature engineering, not to draw business insights.

3. **Data Validation:**
   - Check if the data aligns with business logic.
   - Develop a strategy to handle missing data – either by removing or imputing it.

4. **Feature Transformation:**
   - Transform features as needed and explain why.
   - Consider scaling, encoding, dimensionality reduction, etc.

5. **Model Selection and Optimization:**
   - Test a few candidate algorithms.
   - Optimize the classification threshold if possible and explain your choice.
   - Tune hyperparameters to boost the performance of your top model.
   - Validate the final model by scoring it on the test set.

6. **Select the Best Model:**
   - Choose the model with the highest test set score.

7. **Feature Importance:**
   - Rank the features by their importance and share your insights. Use your best-performing model or any of the candidates for this.

8. **Presentation:**
   - Prepare a presentation (up to 15 minutes) to share your findings from steps 1–7. Use your Jupyter notebook as a visual aid.
   - Ensure your final notebook is clean, with logically ordered code and text cells. Include clear markdown write-ups for your key findings, making it a mini-report.
   - Use bullet points for clarity.
   - Pre-run your notebook to generate the outputs you’ll discuss – avoid running code live to sidestep the “demo effect.”

9. **General Guidelines:**
   - There are no wrong answers, and no specific expectations for metrics like accuracy, precision, or recall.
   - Keep it clear and concise: your notebook should clearly explain your insights and process but avoid making it overly verbose.

## Business problems and data dictionaries

The following pages contain the business problem statements and data dictionaries for Teams 1, 2, 3, 4, and 5. The raw data will be distributed separately.

In [1]:
import pandas as pd

### 1. Term Deposit Marketing Campaign

What types of customers do we need to focus on to ensure the highest success rate?

Logan Bank is running a direct marketing campaign where they call customers to see if they're interested in a Term Deposit. The marketing department needs your help to pinpoint the high-potential customers who are most likely to buy. You've got a dataset of past customer calls, including their details and how they responded to the marketing calls. Also, this isn't Logan Bank's first marketing campaign, so the dataset includes info on customers' responses to previous campaigns. The data set includes the following data columns:

| Column      | Description                                                           |
|-------------|-----------------------------------------------------------------------|
| customerid  | Customer ID                                                           |
| age         | Customer age                                                          |
| salary      | Customer’s salary                                                     |
| balance     | Customer’s account balance                                            |
| marital     | Customer’s marital status                                             |
| education   | Customer’s education                                                  |
| job         | Customer’s job category                                               |
| targeted    | Has the customer been targeted (i.e., contacted) previously?          |
| default     | Does the customer have credit in default?                             |
| housing     | Does the customer have a housing loan?                                |
| loan        | Does the customer have a personal loan?                               |
| contact     | Contact communication type ('cellular' or 'telephone')                |
| day         | Day of the week of the most recent contact                            |
| month       | Month of the most recent contact                                      |
| duration    | Last contact duration, in seconds. Important note: this attribute highly affects the output target (e.g., if duration=0 then response='no'). |
| campaign    | Number of contacts performed during this campaign for this client     |
| Response (target) | Has the client subscribed a term deposit?                            |

In [2]:
pd.read_csv('../data/1_bank_marketing_modified.csv').head()

Unnamed: 0,customerid,age,salary,balance,marital,education,job,targeted,default,housing,loan,contact,day,month,duration,campaign,response
0,1,58,100000,2143,married,tertiary,management,yes,no,yes,no,unknown,5,5,4.35,1,no
1,2,44,60000,29,single,secondary,technician,yes,no,yes,no,unknown,5,5,2.52,1,no
2,3,33,120000,2,married,secondary,entrepreneur,yes,no,yes,yes,unknown,5,5,1.27,1,no
3,4,47,20000,1506,married,unknown,blue-collar,no,no,yes,no,unknown,5,5,1.53,1,no
4,5,33,0,1,single,unknown,unknown,no,no,no,no,unknown,5,5,3.3,1,no


### 2. Credit Card Upselling

Does a bank customer intend to sign up for a new credit card?

Fairweather Bank offers its customers a range of products, including checking and savings accounts, investment options, and credit products. They also cross-sell these products to their current customers. Right now, Fairweather is running a marketing campaign to get existing customers to sign up for a new credit card. They've identified a group of customers who are eligible and are using various communication methods, like telecasting, emails, and mobile banking ads, to promote the new card. Fairweather needs your help to identify which customers are most likely to sign up for the new credit card. You have been provided a dataset which includes the following data columns:

| Column             | Description                                      |
|--------------------|--------------------------------------------------|
| ID                 | Unique Identifier for a row                      |
| Gender             | Gender of the Customer                           |
| Age                | Age of the Customer (in Years)                   |
| Region_Code        | Code of the Region for the customers             |
| Occupation         | Occupation Type for the customer                 |
| Channel_Code       | Acquisition Channel Code for the Customer (Encoded) |
| Vintage            | Vintage for the Customer (In Months)             |
| Credit_Product     | If the Customer has any active credit product (Home loan, Personal loan, Credit Card etc.) |
| AvgAccountBalance  | Average Account Balance for the Customer in last 12 Months |
| Is_Active          | If the Customer is Active in last 3 Months       |
| Is_Lead (target)   | If the Customer is interested for the Credit Card [0 : Customer is not interested, 1 : Customer is interested] |

In [3]:
pd.read_csv('../data/2_Credit_Card _Lead_Prediction_modified.csv').head()

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
0,NNVBBKZB,Female,73,RG268,Other,X3,43,No,1045696,No,No
1,IDD62UNG,Female,30,RG277,Salaried,X1,32,No,581988,No,No
2,HD3DSEMC,Female,56,RG268,Self_Employed,X3,26,No,1484315,Yes,No
3,BF3NC7KV,Male,34,RG270,Salaried,X1,19,No,470454,No,No
4,TEASRWXV,Female,30,RG282,Salaried,X1,33,No,886787,No,No


### 3. Customer Churn

How do we retain existing customers?

A manager at Hubbard Bank knows that bringing in new clients is much costlier than retaining existing ones. It's crucial for Hubbard to understand why clients might decide to leave and seek services elsewhere. By figuring out what customers are likely to leave, Hubbard can develop effective loyalty programs and retention strategies to keep as many customers as possible. You've been asked to help Hubbard identify the customers that are likely to leave the bank. For this purpose, you've been given a historical dataset that includes the following columns:

| Column                  | Description                                           |
|-------------------------|-------------------------------------------------------|
| CLIENTNUM               | Unique identifier for the customer holding the account |
| Attrition_Flag (target) | 1 = account closed, 0 = account remains active        |
| Customer_Age            | Customers Age in years                                |
| Gender                  | Gender                                                |
| Dependent_count         | Number of dependents                                  |
| Education_Level         | Educational Qualification of the account holder       |
| Marital_Status          | Married, Single, Unknown                              |
| Income_Category         | Annual income category of the account holder          |
| Card_Category           | Type of Card                                          |
| Months_on_book          | Months on book                                        |
| Total_Relationship_Count| Total number of products held by the customer         |
| Months_Inactive_12_mon  | Number of months inactive over the last 12 months     |
| Contacts_Count_12_mon   | Number of Contacts in the last 12 months              |
| Credit_Limit            | Limit on Credit Card                                  |
| Total_Revolving_Bal     | Total Revolving Balance on the Credit Card            |
| Avg_Open_To_Buy         | Open to Buy Credit Line (Average of last 12 months)   |
| Total_Amt_Chng_Q4_Q1    | Change in transaction amounts Q4 over Q1              |
| Total_Trans_Amt         | Total Transaction Amount (Last 12 months)             |
| Total_Trans_Ct          | Total Transaction Count (Last 12 months)              |
| Total_Ct_Chng_Q4_Q1     | Change in transaction count Q4 over Q1                |
| Avg_Utilization_Ratio   | Average Card Utilization Ratio                        |

In [4]:
pd.read_csv('../data/3_Bank_Churners.csv').head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


### 4. Credit Risk

#### Minimize loan defaults

Alverstone Bank is looking for data-driven ways to cut down on loan defaults. They need your expertise to flag high-risk borrowers. To assist you, they've given you a historical dataset where each row represents a past loan. The dataset includes the following columns:

| Column                      | Description               |
|-----------------------------|---------------------------|
| person_age                  | Age                       |
| person_income               | Annual Income             |
| person_home_ownership       | Home ownership            |
| person_emp_length           | Employment length (in years) |
| loan_intent                 | Loan intent               |
| loan_grade                  | Loan grade                |
| loan_amnt                   | Loan amount               |
| loan_int_rate               | Interest rate             |
| loan_status (target)        | Loan status (0 is non-default, 1 is default) |
| loan_percent_income         | Percent income            |
| cb_person_default_on_file   | Historical default        |
| cb_person_cred_hist_length  | Credit history length     |

In [5]:
pd.read_csv('../data/4_Credit_risk_dataset_modified.csv').head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123,PERSONAL,D,35000,16.02,Default,0.59,Y,3
1,21,9600,OWN,5,EDUCATION,B,1000,11.14,OK,0.1,N,2
2,25,9600,MORTGAGE,1,MEDICAL,C,5500,12.87,Default,0.57,N,3
3,23,65500,RENT,4,MEDICAL,C,35000,15.23,Default,0.53,N,2
4,24,54400,RENT,8,MEDICAL,C,35000,14.27,Default,0.55,Y,4


### 5. Health Insurance Cross Sell

#### How do we increase car insurance sales?

Augusta Insurance Company, which provides health insurance to its customers, needs your help to determine if last year's policyholders might be interested in vehicle insurance. Using the dataset that includes demographics (gender, age, region code type), vehicle details (vehicle age, damage history), and current policy info (premium, sourcing channel), please assist Augusta in identifying potential customers for their car insurance product.

| Column                  | Definition                                                      |
|-------------------------|-----------------------------------------------------------------|
| id                      | Unique ID for the customer                                      |
| Gender                  | Gender of the customer                                          |
| Age                     | Age of the customer                                             |
| Driving_License         | Does the customer have a driver’s license                       |
| Region_Code             | Unique code for the region of the customer                      |
| Previously_Insured      | Does the customer already have vehicle insurance                |
| Vehicle_Age             | Age of the vehicle                                              |
| Vehicle_Damage          | Has the customer’s vehicle been damaged in the past             |
| Annual_Premium          | The amount the customer needs to pay as an annual premium       |
| Policy_Sales_Channel    | Anonymized code for the channel of outreaching to the customer (e.g., different agents, over mail, over phone, in person, etc.) |
| Vintage                 | Number of days the customer has been associated with the company|
| Response (target)       | Is the customer interested in the insurance product             |

In [6]:
pd.read_csv('../data/5_Health_insurance_cross_sell_modified.csv').head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,Male,44,Yes,28,No,> 2 Years,Yes,40454,26,217,Yes
1,2,Male,76,Yes,3,No,1-2 Year,No,33536,26,183,No
2,3,Male,47,Yes,28,No,> 2 Years,Yes,38294,26,27,Yes
3,4,Male,21,Yes,11,Yes,< 1 Year,No,28619,152,203,No
4,5,Female,29,Yes,41,Yes,< 1 Year,No,27496,152,39,No
