# **Problem Statement:-** 
### The Marketing department of Adventure Works Cycles wants to increase sales by targeting specific customers for a mailing campaign. The company's database contains a list of past customers and a list of potential new customers. By investigating the attributes of previous bike buyers, the company hopes to discover patterns that they can then apply to potential customers.They hope to use the discovered patterns [](http://)to predict which potential customers are most likely to purchase a bike from Adventure Works Cycles.

## Importing data 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
customers=pd.read_csv('/kaggle/input/microsoft-adventure-works-cycles-customer-data/AWCustomers.csv')
display(customers.head())
print("No.of Customers",len(customers))

## Cleaning and Preprocessing

In [None]:
customers.info()

In [None]:
# Title , Suffix , addressLine2 have large amount of null values thus does not contribute much to data
# Name of customer should not have a effect on whether they will purchase a bike or not 
important_cols=['AddressLine1','AddressLine2','City','StateProvinceName','CountryRegionName','PostalCode',
                'BirthDate','Education','Occupation','Gender','MaritalStatus','HomeOwnerFlag',
               'NumberCarsOwned','NumberChildrenAtHome','TotalChildren','YearlyIncome']

In [None]:
df=customers[important_cols]

In [None]:
df.head()

### Handling columns that tell us about geographical location of customer

In [None]:
print(
    "No.of unique address",len(df['AddressLine1'].unique()),'\n',
    "No of unique city",len(df['City'].unique()),'\n',
    "No of unique postal code",len(df['PostalCode'].unique()),'\n',
    "No,of unique states",len(df['StateProvinceName'].unique()),'\n',
    "No of unique country",len(df['CountryRegionName'].unique())
)

only keeping the country column as there are too many cities and states etc.

In [None]:
def preprocess_address(df):
    df.drop(columns=['AddressLine1','AddressLine2','City','StateProvinceName','PostalCode'],axis=1,inplace=True)
    ohe_countries=pd.get_dummies(df['CountryRegionName'],drop_first=True)
    df.drop(columns=['CountryRegionName'],axis=1,inplace=True)
    df=pd.concat([ohe_countries,df],axis=1)
    return df

In [None]:
df=preprocess_address(df)
df.head()

Birth Date Column

In [None]:
df['BirthDate']= pd.to_datetime(df['BirthDate'])

import datetime
CURRENT_TIME = datetime.datetime.now()
def get_age(birth_date,today=CURRENT_TIME):
    y=today-birth_date
    return y.days//365

df['Age']=df['BirthDate'].apply(lambda x: get_age(x))

df.drop(['BirthDate'],axis=1,inplace=True)

df.head()

Education is ordinal column , we map it this order:- 
1. Partial High School
2. High School 
3. Partial College
4. Bachelors 
5. Graduate Degree

In [None]:
df['Education'].value_counts()

In [None]:
df['Education']=df['Education'].map({'Partial High School':1,'High School':2,'Partial College':3,'Bachelors':4,'Graduate Degree':5})

Occupation is ordinal column , we map it this order:- 
1. Manual 
2. Skilled Manual
3. Clerical
4. Management
5. Professional

In [None]:
df['Occupation']=df['Occupation'].map({'Manual':1,'Skilled Manual':2,'Clerical':3,'Management':4,'Professional':5})

In [None]:
df.head()

Handling cardinal columns like Occupation , Gender and Martial Status 

In [None]:
def handle_cardinal_cols(df):
    df['Male']=df['Gender'].map({'M':1,'F':0})
    df.drop(['Gender'],axis=1,inplace=True)
    df['MaritalStatus']=df['MaritalStatus'].map({'M':1,'S':0})
    
    return df

In [None]:
df=handle_cardinal_cols(df)
df.head()

In [None]:
df.isnull().sum()

Large disrespance in range of values of columns specially of salary and other columns , thus normalizing the data using MinMax Scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
def scaleDown(df):
    scaler=MinMaxScaler()
    scaled=scaler.fit_transform(df[['YearlyIncome','Age']])
    df['YearlyIncome_scaled']=scaled[:,0]
    df['Age_scaled']=scaled[:,1]
    df.drop(['YearlyIncome','Age'],axis=1,inplace=True)
    return df

In [None]:
df=scaleDown(df)

In [None]:
df.head()

### Observing the relation between Education and yearlyIncome

In [None]:
from scipy.spatial import distance

In [None]:
distance.cosine(df['Education'].values,df['YearlyIncome_scaled'].values)

In [None]:
distance.jaccard(df['Education'].values,df['YearlyIncome_scaled'].values)

In [None]:
from scipy.stats import pearsonr
pearsonr(df['Education'].values,df['YearlyIncome_scaled'].values)[0]