#

### Goal: The goal of this case study is to understand how using online banking services is related to customer profits.

##### 1. Understand the data: Load the data into a pandas DataFrame and explore it using the following functions:
        a. df.head(), df.info(), df.describe() to get a sense of the data, its structure, and basic statistics.

In [1]:
# Importing required Libraries

import pandas as pd

# loading dataset

pilgrim_raw=pd.read_csv("PilgrimData.csv")

In [None]:
# checking if data loaded correctly and displaying first few rows
pilgrim_raw.head()

#### Before going further I would like to rename columns to make better sense of the data
        Renaming columns with 9 as prefix such as 9Profit, 9Online, 9Age etc. as AnnualProfit(1999), 9Online as OnlineUsage(1999)
        also assuming columns with 0 as prefix are for year 2000
        and columns with prefix 9 but only occurring once are renamed as without prefix 

In [3]:
pilgrim=pilgrim_raw.copy()

In [4]:
# Renaming columns
pilgrim=pilgrim.rename(columns={'9Profit':'AnnualProfit(1999)','9Online':'OnlineUsage(1999)', '9Age':'AgeBucket', 
                                '9Inc':'IncomeBucket', '9Tenure':'Tenure', '9District':'Region','0Profit':'AnnualProfit(2000)',
                                '0Online':'OnlineUsage(2000)','9Billpay':'BillPay(1999)','0Billpay':'BillPay(2000)'})

In [None]:
#exploring dataframe
pilgrim.info()

I observe that AgeBucket and IncomeBucket has missing values and this needs further investigation and AnnualProfit(2000) and OnlineUsage(2000) and BillPay(2000) has less values assuming that the data for year 2000 is still in progress and so we have limited data only. Needs further investigation to draw further conclusions

# I want to check if missing age and missing income has a relation and how do they impact profits

In [None]:
# exploring statistics
pilgrim.describe(include='all')

**I observe that average customer profitability for the year 1999 was $111.50 which matches with the case description.**

#### 2. Check the data format: Use functions like df.dtypes to understand the format and types of each column in your data and identify categorical data
        a. Convert '9District' to a categorical format if needed
        b. Check whether these age and income columns are treated as numerical or categorical

In [None]:
# checking data types of each column
pilgrim.dtypes

As we observe that Region (9District) is a integer data type in the dataframe
But as we know region is a categorical variable represented by numbers in our dataframe
and it doesn't make sense to regress on integers which are essentially catrgory hence, converting it to categorical variable using one hot encoding

Similarly converting: AgeBucket , IncomeBucket also to categorical variable

Why is one hot encoding required?
explain it later


In [8]:

#creating a new dataframe so that i can access inital dataframe which has no manipulations later on for other questions

pilgrim_solution=pilgrim.copy()


In [9]:
#dropping id column as it doesn't makes sense in regression

pilgrim_solution = pilgrim_solution.drop(['ID'], axis = 1)

# converting to categorical
pilgrim_solution = pd.get_dummies(pilgrim_solution, columns=['AgeBucket', 'IncomeBucket', 'Region'], drop_first=True)

#droppimg first columns of categories to avoid the trap of dummy....

#### 3. Visualize variables:
        a. Use histogram to visualize customer profits.
        b. Scatter plots between variables can give an idea of degree of relationships between dependent and independent variables and degree of collinearity between variables

In [None]:
import matplotlib.pyplot as plt

profits = pilgrim_solution['AnnualProfit(1999)'] 

# Plotting the histogram with a specified bin size 
plt.figure(figsize=(10,6))
plt.hist(profits, bins=50, color='blue', alpha=0.7, edgecolor='black')
plt.title('Distribution of Annual Profits for year 1999')
plt.xlabel('Annual Profit')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


The skewness tells that profit is less and losses are high considering most frequency at 0

In [None]:
import matplotlib.pyplot as plt

profits = pilgrim_solution['AnnualProfit(2000)'] 

# Plotting the histogram with a specified bin size 
plt.figure(figsize=(10,6))
plt.hist(profits, bins=50, color='blue', alpha=0.7, edgecolor='black')
plt.title('Distribution of Annual Profits')
plt.xlabel('Annual Profit')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


clearly the profits have gone up compared to 1999 as 16% data is missing yet profit is positive have atleast reduce the losses already...hence, regression for year 2000 is necessary

In [None]:
# Scatterplot of online usage and profits

import matplotlib.pyplot as plt
# Scatter plot with Sales on the y-axis and Advertising on the x-axis
plt.scatter(pilgrim_solution['OnlineUsage(1999)'], pilgrim_solution['AnnualProfit(1999)'])
plt.xlabel('Online Usage for Year 1999')
plt.ylabel('Annual Profits for Year 1999')
plt.title('Scatter Plot of Online Usage vs. Annual Profits for Year 1999')
plt.show()

In [None]:
# Scatterplot of online usage and profits

import matplotlib.pyplot as plt
# Scatter plot with Sales on the y-axis and Advertising on the x-axis
plt.scatter(pilgrim_solution['OnlineUsage(2000)'], pilgrim_solution['AnnualProfit(2000)'])
plt.xlabel('Online Usage for Year 2000')
plt.ylabel('Annual Profits for Year 2000')
plt.title('Scatter Plot of Online Usage vs. Annual Profits for Year 2000')
plt.show()

** Obersvation : Either Online Profit Making customers have switched or thier data is missing or has reduced as compared to 1999

 to get better understanding of the correlation of each varaible/column in dataset... i will pair plot and also plot covariance matrix 
 i will do this only for continous variables as it make sense only for that # better explain this

better data analytics is done in tableau and can be referred there

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame (replace this with your actual DataFrame)



# Using Seaborn to create pair plots between all columns
sns.pairplot(pilgrim)

# Show the plot
plt.show()


In [15]:
# separating out continous columns but why???
cont_columns=['AnnualProfit(1999)', 'OnlineUsage(1999)', 'Tenure', 'AnnualProfit(2000)', 'OnlineUsage(2000)']

understanding relation between variables 

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt




# Calculate the correlation matrix
correlation_matrix = pilgrim_solution[cont_columns].corr()

# Display the correlation matrix
# print(correlation_matrix)

# Plot the correlation matrix
plt.figure(figsize=(4, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()


In [None]:

import seaborn as sns
import matplotlib.pyplot as plt




# Calculate the correlation matrix
correlation_matrix = pilgrim_solution.corr()

# Display the correlation matrix
# print(correlation_matrix)

# Plot the correlation matrix
plt.figure(figsize=(20, 20))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
