## Seyedsaman Emami
### 02.Jul.2021

# Table of contents
* About the Notebook
* Hypothesis
* Importing Libraries
* Overview of dataset
    * Exploring the Data
    * Data sampling
    * Visualizing
    * Outlier treatment
    * Correlation
* Conclusion

# About this notebook
In the following notebook, I reviewed the demographic dataset and investigated different aspects of this simple dataset. 
In the first part, I imported the libraries which I needed for my experiments.
To import the dataset, I considered the Panda's library and read the *CSV file by calling the Pandas's method. And to have an overview of our data frame, I called the five top rows of the dataset by the Head method. For the description, I studied the five features of the dataset (In terms of min, max, std, mean, and quartile). Also, one can find the size, type, and dimensions of the dataset in the related cell. 

To have a clean dataset, I had to check the null or missing values, and if I find any, there are different approaches to deal with them. Hence, I checked the missing values and used another function to remove the null value in the case that we have any.

So it is time to dive into the details by exploring more. For this matter, I visualized the data by plotting the histogram of each feature, scatter plot to see the relationship between pair columns, and box plot to have a summary of quantile, max, min, median, and mean of each attribute.

Regarding the outlier, I reviewed the box plot and scatter plot, and defined the lower and upper bound to drop the outliers.

Finally, I checked the correlation by applying the Pearson method and print out the covariance.

# Hypothesis

* There would be a relationship between different features of the dataset.
* There is a high correlation between one pair of columns.
* There are duplicated values.
* Do customers in different regions spend more per transaction? Which regions spend the most/least?
* Is there a relationship between number of items purchased and amount spent?

# Importing Libraries

In [None]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

For the sake of sampling and other experiments, I added a random seed to generate a different seed for functions.

In [None]:
Random_seed = random.randint(1, 1000)
print(Random_seed)

# Obtaining Data

In [None]:
data = pd.read_csv('/kaggle/input/demographic-data/Demographic_Data.csv')

## An overview of dataset

In [None]:
data.head()

### Check Data description 

In [None]:
data.describe()

### Check the data types

In [None]:
data.info()

## Data cleaning

### Check missing values

In [None]:
data.isnull().sum()

### Drop missing value
> We do not have any

In [None]:
data = data.drop_duplicates()
data.isnull().sum()

### Check our data type

In [None]:
data.dtypes

## Explore the Data

In [None]:
print(data.columns)

### The number of purchases

I tried to group by the dataset to check different features in contrast to the rest.

In [None]:
data.groupby('in-store')['in-store'].count()

Here we can find the average of in-store shopping regarding the sale zone

In [None]:
pd.DataFrame(data.groupby(data['region'])['in-store'].mean())

Many libraries in python during the implementation, print out the logs. To have a clear output, I ignore the printing of the outputs.

In [None]:
import warnings
warnings.simplefilter("ignore")
sns.factorplot('in-store', data=data, kind='count', aspect=1)
plt.title("Number of in store purchases")

Same as previous without duplicated purchased amount

In [None]:
subset = data.drop_duplicates(subset='amount')
sns.factorplot('in-store', data = subset, kind='count', aspect=1)
plt.title("Number of in store purchases")

Check the description of only two features

In [None]:
data[['amount','items']].describe()

## Dropping duplicates values by subsetting the amount
From the previous cell, I understand that the plots do not make a sense so, I thought about duplicated values in a specific column.

First of all, I sorted my data frame to have the ideal form of table which I was looking for.

I stored the modified dataset in a *Sorted* DataFrame

In [None]:
Sorted = data.sort_values('items', inplace=False)
Sorted.drop_duplicates(subset='amount', inplace=True)
Sorted.tail()

## Visualizing the data 

In [None]:
Sorted = data.groupby(data['items']).count()
it = Sorted.iloc[:, 0:3]
plt.hist(Sorted)
plt.legend(Sorted.columns)
plt.title('Counting items')
print(Sorted)
print(it)

In [None]:
Sorted = data.sort_values('items', inplace=False)
Sorted.drop_duplicates(subset='amount', inplace=True)

age = Sorted.groupby(Sorted['age']).count()
plt.hist(age)
plt.legend(age.columns)
plt.title('Counting items')
plt.xlabel('Values')
plt.ylabel('age')

### Check the columns' names

### Plotting
>  a histogram on the features

In [None]:
plt.figure(figsize=(10, 10))
for i, j in enumerate(data.columns):
    plt.subplot(3, 2, i+1)
    plt.hist(Sorted[j], color='teal', histtype='bar')
    plt.title(str(data.columns[i]))
    plt.ylabel("Values") 

### Data sampling

I generated a random sample from the main dataset to explore the dataset in a small dimension.

In [None]:
sample = Sorted.sample(frac=0.005, random_state=Random_seed)
print(sample.shape)
sample.tail()

In [None]:
X = sample['age']
y = sample['amount']
print("Age sample:", X, '\n',
      "Amount sample:", y)

### Comparing 
> Let's compare the simple samples of two different features of our dataset

In [None]:
plt.scatter(X, y, marker='o', c='lawngreen')
plt.title("Scatter plot")
plt.xlabel("age")
plt.ylabel("amount")
plt.show()

# Scatter

We have the same scatter as the previous with this difference that I defined a python method here to return the scatter plot for the user.

In [None]:
X = sample['region']
y = sample['amount']

def sct(X, y, Xlabel=None, ylabel=None):
    plt.scatter(X, y, marker='o', c='lawngreen')
    plt.title("Scatter plot")
    plt.xlabel(Xlabel)
    plt.ylabel(ylabel)
    return plt.show()
sct(X, y, 'age', 'amount')

In [None]:
X = sample['items']
y = sample['amount']

def sct(X, y, Xlabel=None, ylabel=None):
    plt.scatter(X, y, marker='o', c='lawngreen')
    plt.title("Scatter plot")
    plt.xlabel(Xlabel)
    plt.ylabel(ylabel)
    return plt.show()
sct(X, y, 'items', 'amount')

### feature dentification 

Instead of printing plots one-by-one, we can print them out in one loop in the range of features of the dataset.

In [None]:
plt.figure(figsize=(10, 15))
for i, j in enumerate(data.columns):
    plt.subplot(3, 2, i+1)
    plt.boxplot(Sorted[j], 0, 'gD', showmeans=True,
                meanline=True, autorange=True)
    plt.title('Box plt - ' + str(data.columns[i]))

# Outlier treatment

As we can see in the amount, we have outliers values.

After seeing the box plot, I curious about the "amount" feature, so I decided to have a subsample of it and compare it with its median. 

In [None]:
outliers = np.where(Sorted['amount']>2000)
print('outliers indes:', outliers)
(Sorted['amount']>2000).shape

In [None]:
sct(Sorted['region'], Sorted['amount'], 'region', 'amount')

In [None]:
amount = (sample.amount).values
m = []
for i in range(amount.shape[0]):
    m.append(np.mean(amount))
plt.plot(amount, label='Amount of purchase')
plt.plot(m, linewidth=3, color='r', label='Median')
plt.legend()
plt.title("Checking amount outliers")

> 

In [None]:
z = np.abs(stats.zscore(amount))
plt.plot(z, c='g', alpha=0.3)
plt.ylabel('Distance')
plt.xlabel('index')
plt.title('Distance of the amount value from the mean')

Defining the bounds to remove the outliers

In [None]:

Q1 = np.percentile(amount, 25, interpolation='midpoint')
Q3 = np.percentile(amount, 65, interpolation='midpoint')
IQR = Q3 - Q1
amount.shape
upper = np.where(amount>=(Q3+1.5*IQR))
lower = np.where(amount<=(Q1-1.5*IQR))
new = pd.DataFrame(amount)
new.drop(upper[0], inplace=True)
new.drop(lower[0], inplace=True)
print(new.shape)
new.head()

In [None]:
plt.plot(amount, label='Amount of purchase')
plt.plot(new, label='Clean amount', c='g', alpha=0.7)
plt.plot(m, linewidth=3, color='r', label='Median')
plt.legend()
plt.title("Comparing the amount with and without outliers")

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,2,1)
plt.boxplot(new, 0, 'gD', showmeans=True,
                meanline=True, autorange=True)
plt.title('Cleaned amount without outliers')
plt.subplot(2,2,2)
plt.boxplot(amount, 0, 'gD', showmeans=True,
                meanline=True, autorange=True)
plt.title('Real values of amount')

### Check the Correlation

I chose the pearson method to return the Correlation Coefficient matrix

In [None]:
Sorted.corr("pearson")

#### Visualize the correlation

As we only have four attributes, it is easier to check the correlation over a heatmap

In [None]:
sns.heatmap(data.corr("pearson"), annot=True, cmap='GnBu')

 ## Joint variability

### Measure of the joint variability by using the Covariance

In [None]:
Sorted.cov()

In [None]:
sns.heatmap(Sorted.cov(), annot=True)

# Conclusion

I imported the libraries which were useful to my proposal of EDA the dataset. After entering the dataset, and have an overview of the data frame, I looked for missing values and I did not find any.

To have a summary of our features, I grouped the dataset into different parts and studied them separately. My studies included sorting, histogram, scatter plots, box plots, data frame, correlations, covariance size of the data, and shape of the arrays.

From the previous studies, I found that we have duplicate values in one of our features which were "amount" so I removed them and continue with the new dataset.

Also, from different plots such as the box plot, I noted the outliers values, So I defined the various percentile of the relevant axis to drop them. After removing outliers, I compared the shape, size, and behavior of the removed outliers and the principal dataset.

Regarding the questions and hypothesizes, I should say that;
-	I found a relationship and correlation between different features of the dataset.
-	There were duplicate values in the dataset.
- Customers in regions one and four, spent more money to buy their items. And region two has the lowers amount of purchase.
-	Yes, there is a relationship between number of items purchased and amount spent.