
## Project Title 
Sale Prediction from Existing Customer Using Logistic Regression

## Objective
This project aims to predict the likelihood of a sale from existing customers using logistic regression, focusing on customer demographics (such as age and salary) as predictors.

## Machine Learning Workflow (Documentation)

### Machine Learning Workflow:

Data Collection: Gather the dataset containing customer demographics and purchase status.
Data Exploration and Wrangling: Analyze and clean the data to ensure quality.
Data Preparation (Feature Engineering): Select and preprocess relevant features.
Model Building and Training: Develop and train a logistic regression model.
Model Evaluation: Assess model performance using metrics like accuracy.
Fine-tuning: Adjust the model to improve performance (if necessary).
Final Evaluation: Confirm the model's accuracy on unseen data.

In [20]:
# Import necessary libraries
import pandas as pd           # Data manipulation
import numpy as np            # Numerical operations
import matplotlib.pyplot as plt  # Data visualization

# Machine learning libraries
from sklearn.model_selection import train_test_split  # Data splitting
from sklearn.preprocessing import StandardScaler      # Feature scaling
from sklearn.linear_model import LogisticRegression   # Logistic regression model
from sklearn.metrics import confusion_matrix, accuracy_score  # Model evaluation metrics


In [21]:
# Load the dataset
df = pd.read_csv(r"C:\\Users\\hp\\Documents\\Data science Projects\\Project 3\\DigitalAd_dataset.csv")

# Display the first few rows of the dataset
df.head()


Unnamed: 0,Age,Salary,Status
0,18,82000,0
1,29,80000,0
2,47,25000,1
3,45,26000,1
4,46,28000,1


In [22]:
# Checking dataset information, including data types and non-null counts
df.info()

# Displaying the shape of the dataset (number of rows and columns)
print(df.shape)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Age     400 non-null    int64
 1   Salary  400 non-null    int64
 2   Status  400 non-null    int64
dtypes: int64(3)
memory usage: 9.5 KB
(400, 3)


In [23]:
# Check for missing values in each column
df.isnull().sum()


Age       0
Salary    0
Status    0
dtype: int64

In [24]:
# Generate descriptive statistics of the dataset (e.g., mean, standard deviation)
df.describe()


Unnamed: 0,Age,Salary,Status
count,400.0,400.0,400.0
mean,37.655,69742.5,0.3575
std,10.482877,34096.960282,0.479864
min,18.0,15000.0,0.0
25%,29.75,43000.0,0.0
50%,37.0,70000.0,0.0
75%,46.0,88000.0,1.0
max,60.0,150000.0,1.0


In [25]:
# Segregate the dataset into features (X) and target (y)
# Selecting 'Age' and 'Salary' as independent variables (X)
X = df[['Age', 'Salary']]  # Input features
# Selecting 'Status' as the target variable (y)
y = df['Status']


In [26]:
# Display the first few rows of the target variable
y.head()


0    0
1    0
2    1
3    1
4    1
Name: Status, dtype: int64

In [27]:
# Display the first few rows of the features
X.head()


Unnamed: 0,Age,Salary
0,18,82000
1,29,80000
2,47,25000
3,45,26000
4,46,28000


In [28]:
# Drop 'Age' and 'Salary' columns from the original DataFrame if they're no longer needed
df = df.drop(['Age', 'Salary'], axis=1)


In [29]:
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the StandardScaler for scaling features
scaler = StandardScaler()

# Fit and transform the training set; only transform the test set
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [30]:
# # Feature scaling: making all the features of X to be the same ie Age and Salary. DOnt touch Y
# #we scale our data to make all the features contribute equally to the result
# #Fit_transform-fit method is calculating the mean and variance of each of the
# #features present in our data
# #Transform- Transform method is transforming all the features using the mean and variance

# sc = StandardScaler()
# x_train = sc.fit_transform(x_train)
# x_test = sc.transform(x_test)

In [31]:
# Initialize the Logistic Regression model
logistic_model = LogisticRegression(random_state=42)

# Train the model using the training data
logistic_model.fit(X_train, y_train)


In [39]:
# Predict the target values using the test set
y_pred = logistic_model.predict(X_test)

# Evaluate model performance using accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print accuracy score
print(f"Accuracy: {accuracy}")

# Print confusion matrix
print("Confusion Matrix:\n", conf_matrix)

Accuracy: 0.825
Confusion Matrix:
 [[54  1]
 [13 12]]


In [40]:
age = int(input("Enter the age of the new customer: "))
salary = int(input("Enter the salary of the new customer: "))

# Create a list with the new customer's data
new_customer= [[age, salary]]

# Make predictions using the trained model
result = logistic_model.predict(scaler.transform(new_customer))
print(result)
if result == 1:
    print("Customer will buy")
else:
    print("Customer Won't buy")

Enter the age of the new customer:  220
Enter the salary of the new customer:  212


[1]
Customer will buy




### The Summary of Our Trained model
The model is good at predicting people who didn’t buy (54 correct out of 55).
The model struggles more with predicting who will buy:

It missed 13 people who actually bought, predicting that they didn’t buy instead.

The model is generally accurate, but it tends to miss more actual positives (13 false negatives), 
which may mean that it has room for improvement in identifying positive cases accurately.
