## Financial Inclusion in Africa

In this Project, we'll work on the **'Financial Inclusion in Africa'** dataset that was provided as part of the Financial Inclusion in Africa hosted by the Zindi platform.

Dataset description: The dataset contains demographic information and what financial services are used by approximately 33,600 individuals across East Africa. The ML model role is to predict which individuals are most likely to have or use a bank account.

The term financial inclusion means:  individuals and businesses have access to useful and affordable financial products and services that meet their needs – transactions, payments, savings, credit and insurance – delivered in a responsible and sustainable way.


https://i.imgur.com/UNUZ4zR.jpg



**Instructions**

1. Install the necessary packages
2. Import you data and perform basic data exploration phase
- Display general information about the dataset
- Create a pandas profiling reports to gain insights into the dataset
- Handle Missing and corrupted values
- Remove duplicates, if they exist
- Handle outliers, if they exist
- Encode categorical features
3. Based on the previous data exploration train and test a machine learning classifier
4. Create a streamlit application (locally) and add input fields for your features and a validation button at the end of the form
5. Import your ML model into the streamlit application and start making predictions given the provided features values
6. Deploy your application on Streamlit share:
- Create a github and a streamlit share accounts
- Create a new git repo
- Upload your local code to the newly created git repo
- log in to your streamlit account an deploy your application from the git repo

#### Importing necessary libraries

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#### Loading the dataset

In [None]:
df = pd.read_csv("Financial_inclusion_dataset.csv")

#### Overview of the dataset

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

##### Summary statistics

In [None]:
df.describe()

##### Checking for missing values and duplicates

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

No missing values and duplicates Cool!

### Encoding

#### Binary Variables

In [None]:
df['bank_account'].value_counts()

In [None]:
df['location_type'].value_counts()

In [None]:
df['cellphone_access'].value_counts()

In [None]:
df['gender_of_respondent'].value_counts()

#### Binary Encoding with Binary variables

In [None]:
df['bank_account'] = df['bank_account'].map({'Yes': 1, 'No': 0})
df['cellphone_access'] = df['cellphone_access'].map({'Yes': 1, 'No': 0})
df['gender_of_respondent'] = df['gender_of_respondent'].map({'Female': 0, 'Male': 1})
df['location_type'] = df['location_type'].map({'Rural': 0, 'Urban': 1})

#### Multi-category Variables

In [None]:
df['marital_status'].value_counts()

In [None]:
df['education_level'].value_counts()

In [None]:
df['job_type'].value_counts()

#### One-Hot Encoding for multi-category variables

In [None]:
df = pd.get_dummies(df, columns=['job_type'], drop_first=True)

In [None]:
df.head()

#### Feature Engineering

In [None]:
df.columns

In [None]:
df['has_income'] = df[['job_type_Farming and Fishing', 'job_type_Formally employed Government', 'job_type_Formally employed Private', 
                       'job_type_Informally employed', 'job_type_Other Income', 'job_type_Self employed'  ]].sum(axis=1)

In [None]:
df['is_married'] = df['marital_status'].apply(lambda x: 1 if x == 'Married/Living together' else 0)
df['is_single'] = df['marital_status'].apply(lambda x: 1 if x == 'Single/Never Married' else 0)

In [None]:
# Creating binary columns for each education level
df['primary_education'] = df['education_level'].apply(lambda x: 1 if x == 'Primary education' else 0)
df['no_education'] = df['education_level'].apply(lambda x: 1 if x == 'No formal education' else 0)
df['secondary_education'] = df['education_level'].apply(lambda x: 1 if x == 'Secondary education' else 0)
df['tertiary_education'] = df['education_level'].apply(lambda x: 1 if x == 'Tertiary education' else 0)
df['vocational_training'] = df['education_level'].apply(lambda x: 1 if x == 'Vocational/Specialised training' else 0)
df['other_education'] = df['education_level'].apply(lambda x: 1 if x == 'Other/Dont know/RTA' else 0)

#### Correlation matrix to check linear relationship

In [None]:
# Selecting only numeric columns for correlation
numeric_df = df.select_dtypes(include=[float, int])
correlation_matrix = numeric_df.corr()

In [None]:
correlation_matrix

#### Variance Threshold to eliminate low variance features

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Separate the features (X) and the target variable (y)
X = df.drop(columns=['country', 'uniqueid', 'gender_of_respondent', 'relationship_with_head', 'marital_status', 'education_level', 'bank_account'])
y = df['bank_account']

# Create the VarianceThreshold object with a specified threshold
selector = VarianceThreshold(threshold=0.1)

# Fit the model on the feature data
X_var_thresh = selector.fit_transform(X)

# Check which features remain
remaining_features = X.columns[selector.get_support()]
print(remaining_features)


### Modelling

#### Logistic Regression(Lasso- Least Absolute Shrinkage and Selection Operator) Embedded Method

To select my best features

In [None]:
from sklearn.linear_model import LogisticRegression

Selecting my features and splitting the data into training and test sets

In [None]:
X = df.drop(columns=['year', 'country', 'uniqueid', 'gender_of_respondent', 'relationship_with_head', 'marital_status', 
                    'education_level', 'vocational_training', 'bank_account', 'job_type_Remittance Dependent',])

y = df['bank_account']

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.2, random_state =42)

In [None]:
print(X_train.columns)

This time we are using the logistic regression model with a penalty = l1 which is used to reduce loss or error in the model

In [None]:
model= LogisticRegression(penalty = 'l1', C = 1.0, solver = 'liblinear')
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
ACC = accuracy_score(y_pred, y_test)
ACC

In [None]:
conf_mat = confusion_matrix(y_pred, y_test)
conf_mat

In [None]:
class_report = classification_report(y_pred, y_test)
print(class_report)

In [None]:
model.coef_

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define the figure and axis
fig = plt.figure()
ax = plt.subplot(111)

# 28 color definitions
colors = ['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 
          'black', 'pink', 'lightgreen', 'lightblue', 'gray', 
          'indigo', 'orange', 'salmon', 'purple', 'gold', 
          'silver', 'brown', 'violet', 'lime', 'teal', 
          'navy', 'maroon', 'olive', 'coral', 'chocolate', 
          'crimson', 'darkblue']

weights, params = [], []

# Loop through regularization strengths
for c in np.arange(-4., 6.):
    model2 = LogisticRegression(penalty='l1', C=10.**c, solver='liblinear', random_state=42)
    model2.fit(X_train, y_train)
    weights.append(model2.coef_)
    params.append(10**c)

weights = np.array(weights)

# Plot each column's weights using the color list
for column, color in zip(range(weights.shape[2]), colors):  # Use shape[2] for correct column size
    plt.plot(params, weights[:, 0, column],  # Access weights by [:, 0, column] for 2D plot
             label=X.columns[column],  # Ensure X.columns has the right size
             color=color)

# Add horizontal line at y=0
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('Weight coefficient')
plt.xlabel('C (inverse regularization strength)')
plt.xscale('log')
plt.legend(loc='upper left')

# Set the position of the legend
ax.legend(loc='upper center', bbox_to_anchor=(1.38, 1.03), ncol=1, fancybox=True)

# Save the figure
plt.savefig('lasso-path.pdf', dpi=300, bbox_inches='tight', pad_inches=0.2)

# Show the plot
plt.show()


#### Saving my model

In [None]:
import joblib

In [None]:
joblib.dump(model, 'financialmodel.pkl')

### Create The Streamlit App

In [None]:
# Create the file Expresso_Churn_Prediction_Streamlit_App.py in write mode
with open("Financial_Inclusion_Prediction_Streamlit_App.py", "w") as file:
    # Writing the Streamlit code into the file
    file.write("""
# Import necessary libraries
import streamlit as st
import pandas as pd
import joblib

# Load the pre-trained model
model = joblib.load("expressoModel.pkl")  

# Set up the Streamlit app
st.title('Expresso Client Churn Prediction')
st.write("This app predicts the churn probability for Expresso clients based on their behavior.")


# Input fields for user to enter feature values
frequence_rech = st.number_input('Recharge Frequency (FREQUENCE_RECH)', min_value=1.0, max_value=114.0, value=11.44, step=1.0)
revenue = st.number_input('Revenue (REVENUE)', min_value=1.0, max_value=165166.0, value= 5454.27, step=0.1)
frequence = st.number_input('Frequency of usage (FREQUENCE)', min_value=1.0, max_value=91.0,value = 13.88, step=1.0)
data_volume = st.number_input('Data Volume (DATA_VOLUME)', min_value=0.0, max_value=560933.0,value = 3165.06, step=0.1)
on_net = st.number_input('On Net Usage (ON_NET)', min_value=0.0, max_value=20837.0,value = 272.18, step=1.0)
orange = st.number_input('Orange Network Usage (ORANGE)', min_value=0.0, max_value=4743.0, value = 96.23, step=1.0)
regularity = st.number_input('Regularity of usage (REGULARITY)', min_value=0.0, max_value=1346.0, value=7.93, step=1.0)
freq_top_pack = st.number_input('Frequency of Top Pack (FREQ_TOP_PACK)', min_value=1.0, max_value=320.0, value=9.20, step=1.0)

# Create a dictionary with the input data
input_data = {
    'FREQUENCE_RECH': frequence_rech,
    'REVENUE': revenue,
    'FREQUENCE': frequence,
    'DATA_VOLUME': data_volume,
    'ON_NET': on_net,
    'ORANGE': orange,
    'REGULARITY': regularity,
    'FREQ_TOP_PACK': freq_top_pack
}

# Convert the dictionary to a DataFrame
input_df = pd.DataFrame([input_data])

# Predict churn probability using the loaded model
if st.button('Predict Churn Probability'):
    prediction = model.predict_proba(input_df)[:, 1]  # Probability of churn
    churn_probability = round(prediction[0] * 100, 2)
    st.write(f"The predicted churn probability is {churn_probability}%")

# Option to display input data
if st.checkbox('Show Input Data'):
    st.write(input_df)

""")