### Introduction
The data for this analysis is from the Kaggle web source. This data will be used to perform analysis between the two top shoes brands Nike and Adidas. Among the task to be carried out is the exploratory analysis, clustering based on the products and compettitive analysis between the brands.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df= pd.read_csv("../input/adidas-vs-nike/Adidas Vs Nike.csv")
df.head()

In [None]:
df.shape

In [None]:
df.dtypes

After checking the types of data I noticed the listing price and sale price are not in the 
right format of amount. To get the correct format of amount, I divided the listing price and
sale price column by 100.

In [None]:
df[['Listing Price', 'Sale Price']]=df[['Listing Price' , 'Sale Price']].apply(lambda x: x/100)

In [None]:
df.head()

In [None]:
missing_value = df.isnull()
missing_value.head()

In [None]:
for column in missing_value.columns.values.tolist():
    print(column)
    print (missing_value[column].value_counts())
    print("")    

The above code is to check for the missing value, it shows that, there are no missing values, but three rows of strings are missing the Description column. This is not a major concern in the analysis so we ignore the missing string. The total rows and cloumns contained in the data is 3628 rows and 10 columns respectively.

For the sake of analysis, the columns that are needed for the analysis are dropped

In [None]:
df.drop(['Product Name', 'Product ID', 'Description', 'Last Visited'], axis=1, inplace=True)
df.head()

## Exploratory Analysis

To better understand the data set, I decided to explore using the visualization and statistical analysis importing all the libraries that are needed. I used the python function .describe() to see the basic statistical breakdown of the data. The bar and histogram was used to see the counts (frequency of occurence) of the brands. To achieve this I have to group by the brands to view the sale price frequency for each brands and the ratings on each brands.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()

In [None]:
df.describe()

In [None]:
listing_Price = df['Listing Price']
sns.distplot(listing_Price, bins=100, label = 'Listing Price')
price = df['Sale Price']
sns.distplot(price, bins=100, label='Sale Price')
plt.title('Price Distribution')
plt.legend()
plt.xlabel('Price')
plt.show()

From the statisitcs of the data, the mean sale price is 61.34 and the median 47.99, with the minimum and maximum sale price to be 4.49 and 365.0 respectively. The average discount is 26.87, also the ratings is between 0-5 scale, zero being the bad rating or NO rating and five being the best rating. Observing the distribution of the Sale Price and Listing Price, the Prices tends to be Right-Skewed distribution, with the sale price more normally distributed than the listing Price.

To further understand the data I grouped the brands to display the count of each unique brands. From the count I noticed there is a one brand with one count, looking closely I noticed there is repetition of word 'adidas' in the brand name. To get a meaningful data, I removed one of the repeated word and perform another value count of the brands, this updates the Adidas Originals from 907 to 908 as seen below.

In [None]:
df['Brand'].value_counts()

In [None]:
df['Brand']= df['Brand'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')

Code Note: \b        # word boundary
          (\w+)     # 1st capture group of a single word
          ( 
          \s+       # 1 or more spaces
          \1        # reference to first group 
          )+        # one or more repeats
          \b

In [None]:
Brand_count = df['Brand'].value_counts()
Brand_count

In [None]:
Brand_count.plot(kind='barh', title= 'Brand Counts', color='skyblue')

The value counts above for the brands shows that the Adidas Core/Neo has the highest sales of 1111. The Nike brands sold was 643 while the Adidas sport performance has least sale of 606.

In comparison to the products sold, I estimated for the statistical information on the sale price for each brand.

In [None]:
Sale_Price_Stat = df.groupby('Brand')['Sale Price'].describe()
Sale_Price_Stat

Nike has the highest average price of 102.13 as seen above, while Adidas Core/Neo brand has the least average price of $31.16.

In [None]:
Sale_Price_Avg = Sale_Price_Stat.plot(kind='barh', y='mean', title= 'Average Sale Price', color='skyblue')

On the other hand the relationship of the Sale Price to counts of brands clears more point that, more Adidas tends to have more the Adidas CORE/Neo brand because of its lesser price.

The average ratings for the brands was estimated

In [None]:
ratings_desc = df.groupby('Brand')['Rating'].describe()
ratings_desc

In [None]:
ratings_desc.plot.barh(y='mean', color='skyblue', title= 'Average Rating', label= 'Avg Rating')

The results shows that the Adidas Core/Neo brands has the highest average ratings, which implies that more ratings came from the customers who bougth the Adidas CORE/NEO brand.

To understand more about the ratings I did a count of the ratings from the minimum to maximum rating value and the bar plot of the Average Rating

In [None]:
Rating_Dist = df.Rating.value_counts()

plt.figure(figsize=(10, 4))

sns.barplot(x=Rating_Dist.index, y=Rating_Dist.values)

plt.xlabel('Rating')

plt.ylabel('Counts')

plt.show()

It tends to be that, about 300 customers gave a zero rating which I consider a NO rating, while about 230 customers gave a rating of 5 as seen from the plot about.

### Competitive Analysis

To perform the competitive analyis on the brands, I created a dataframe for each brands, estimated the statistical data for each brands and perform a Sale Price comparison

In [None]:
N_Brand = df[df['Brand'].str.contains('Nike')]
N_Brand.reset_index(drop=True, inplace=True)
N_Brand.head()

In [None]:
N_Brand.describe()

DataFrames for all Adidas Brand

In [None]:
A1_Brand = df[df['Brand'].str.contains('Adidas ORIGINALS')]
A2_Brand = df[df['Brand'].str.contains('Adidas CORE / NEO')]
A3_Brand = df[df['Brand'].str.contains('Adidas SPORT PERFORMANCE')]
A1_Brand.reset_index(drop=True, inplace=True)
A2_Brand.reset_index(drop=True, inplace=True)
A3_Brand.reset_index(drop=True, inplace=True)

After saving all brands on separate dataframes I made a grouping of the Sale Price on each brands to count the number of sale price

SPN represents the Sale Price for Nike

In [None]:
SPN_50 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 0) & (N_Brand['Sale Price'] <= 50)]
SPN_100 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 51) & (N_Brand['Sale Price'] <= 100)]
SPN_150 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 101) & (N_Brand['Sale Price'] <= 150)]
SPN_200 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 151) & (N_Brand['Sale Price'] <= 200)]
SPN_250 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 201) & (N_Brand['Sale Price'] <= 250)]
SPN_300 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 251) & (N_Brand['Sale Price'] <= 300)]
SPN_350 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 301) & (N_Brand['Sale Price'] <= 350)]
SPN_400 = N_Brand['Sale Price'][(N_Brand['Sale Price'] >= 351) & (N_Brand['Sale Price'] <= 400)]

SPNx = ['0 - 50', '51 - 100', '101 - 150', '151 - 200', '201 - 250', '251 -300', '301 - 350', '351 - 400']
SPNy = [len(SPN_50.values), len(SPN_100.values), len(SPN_150.values), len(SPN_200.values), len(SPN_250.values), len(SPN_300.values), len(SPN_350.values), len(SPN_400.values)]


plt.figure(figsize=(18, 8))

sns.barplot(x=SPNx, y=SPNy, color='b', alpha=0.8)

plt.title('Nike Sale Price', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Sale Price', fontsize = 15)
plt.ylabel('No of Sold', fontsize = 15)
plt.show()

SPA1 represents the Sale Price for Adidas ORIGINAL

In [None]:
SPA1_50 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 0) & (A1_Brand['Sale Price'] <= 50)]
SPA1_100 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 51) & (A1_Brand['Sale Price'] <= 100)]
SPA1_150 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 101) & (A1_Brand['Sale Price'] <= 150)]
SPA1_200 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 151) & (A1_Brand['Sale Price'] <= 200)]
SPA1_250 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 201) & (A1_Brand['Sale Price'] <= 250)]
SPA1_300 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 251) & (A1_Brand['Sale Price'] <= 300)]
SPA1_350 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 301) & (A1_Brand['Sale Price'] <= 350)]
SPA1_400 = A1_Brand['Sale Price'][(A1_Brand['Sale Price'] >= 351) & (A1_Brand['Sale Price'] <= 400)]

SPA1x = ['0 - 50', '51 - 100', '101 - 150', '151 - 200', '201 - 250', '251 -300', '301 - 350', '351 - 400']
SPA1y = [len(SPA1_50.values), len(SPA1_100.values), len(SPA1_150.values), len(SPA1_200.values), len(SPA1_250.values), len(SPA1_300.values), len(SPA1_350.values), len(SPA1_400.values)]

plt.figure(figsize=(18, 8))


sns.barplot(x=SPA1x, y=SPA1y, color='r', alpha=0.8)

plt.title('Adidas ORIGINALS', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Sale Price', fontsize = 15)
plt.ylabel('No of Sold', fontsize = 15)
plt.show()

SPA2 represents the Sale Price for Adidas CORE/NEO

In [None]:
SPA2_50 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 0) & (A2_Brand['Sale Price'] <= 50)]
SPA2_100 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 51) & (A2_Brand['Sale Price'] <= 100)]
SPA2_150 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 101) & (A2_Brand['Sale Price'] <= 150)]
SPA2_200 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 151) & (A2_Brand['Sale Price'] <= 200)]
SPA2_250 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 201) & (A2_Brand['Sale Price'] <= 250)]
SPA2_300 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 251) & (A2_Brand['Sale Price'] <= 300)]
SPA2_350 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 301) & (A2_Brand['Sale Price'] <= 350)]
SPA2_400 = A2_Brand['Sale Price'][(A2_Brand['Sale Price'] >= 351) & (A2_Brand['Sale Price'] <= 400)]

SPA2x = ['0 - 50', '51 - 100', '101 - 150', '151 - 200', '201 - 250', '251 -300', '301 - 350', '351 - 400']
SPA2y = [len(SPA2_50.values), len(SPA2_100.values), len(SPA2_150.values), len(SPA2_200.values), len(SPA2_250.values), len(SPA2_300.values), len(SPA2_350.values), len(SPA2_400.values)]
plt.figure(figsize=(18, 8))


sns.barplot(x=SPA2x, y=SPA2y, color='g', alpha=0.8)

plt.title('Adidas CORE / NEO', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Sale Price', fontsize = 15)
plt.ylabel('No of Sold', fontsize = 15)
plt.show()

SPA3 represents the Sale Price for Adidas SPORT PERFORMANCE

In [None]:
SPA3_50 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 0) & (A3_Brand['Sale Price'] <= 50)]
SPA3_100 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 51) & (A3_Brand['Sale Price'] <= 100)]
SPA3_150 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 101) & (A3_Brand['Sale Price'] <= 150)]
SPA3_200 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 151) & (A3_Brand['Sale Price'] <= 200)]
SPA3_250 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 201) & (A3_Brand['Sale Price'] <= 250)]
SPA3_300 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 251) & (A3_Brand['Sale Price'] <= 300)]
SPA3_350 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 301) & (A3_Brand['Sale Price'] <= 350)]
SPA3_400 = A3_Brand['Sale Price'][(A3_Brand['Sale Price'] >= 351) & (A3_Brand['Sale Price'] <= 400)]

SPA3x = ['0 - 50', '51 - 100', '101 - 150', '151 - 200', '201 - 250', '251 -300', '301 - 350', '351 - 400']
SPA3y = [len(SPA3_50.values), len(SPA3_100.values), len(SPA3_150.values), len(SPA3_200.values), len(SPA3_250.values), len(SPA3_300.values), len(SPA3_350.values), len(SPA3_400.values)]

plt.figure(figsize=(18, 8))


sns.barplot(x=SPA3x, y=SPA3y, color='y', alpha=0.8)

plt.title('Adidas SPORT PERFORMANCE', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Sale Price', fontsize = 15)
plt.ylabel('No of Sold', fontsize = 15)
plt.show()

### KMeans Clustering 

The clustering of the data was based on the Sale Price, Listing Price and Rating. Important libraries to be used are loaded

In [None]:
from sklearn import preprocessing 
from sklearn.cluster import KMeans
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score

from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
df.columns

To start the clustering of the data, all I needed in the data are the columns with the numerical data that can be used to test the pairs of the data. Therefore, I dropped the brand and reviews in the column to be able to normalize the data.

In [None]:
df.drop(['Brand', 'Reviews'], axis=1, inplace=True)
df.head()

The next step was to Normalize the data. I normalized the using the python built in library StandardScaler, after which

In [None]:
df_norm = df.values

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
df_norm_scaled = min_max_scaler.fit_transform(df_norm)
df_new = pd.DataFrame(df_norm_scaled)
df_new.head()

In [None]:
df_new.columns = ['Listing Price', 'Sale Price', 'Discount','Rating']
df_new.head()

### Elbow Curve Method 

Elbow Curve was used to determine the appropriate number Clusters. I plotted the curve varying the number of k values from 1 to 12 and calculated the the within-clusters sum of square (WCSS) to be plot against the varying k value.

In [None]:
wcss = []
for k in range(1,13):
    kmeans = KMeans(n_clusters=k, init="k-means++")
    kmeans.fit(df_new.iloc[:,1:])
    wcss.append(kmeans.inertia_)
plt.figure(figsize=(12,6))    
plt.grid()
plt.plot(range(1,13),wcss, linewidth=2, color="blue", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,15,1))
plt.ylabel("WCSS")
plt.show()

The result of the curve can be seen above, which shows that the elbow was curved at the K-value of 3, this value will be the number clusters to be used.

In [None]:
X = StandardScaler().fit_transform(df_new)
X

In [None]:
num_clusters = 3

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(X)
labels = k_means.labels_

print(labels)

In [None]:
df_new["Labels"] = labels
df_new.head()

In [None]:
df_new.groupby('Labels').mean()

In [None]:
plt.figure(figsize=(20,10))
plt.scatter(X[labels==0, 0], X[labels==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[labels==1, 0], X[labels==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[labels==2, 0], X[labels==2, 1], s=100, c='green', label ='Cluster 3')

plt.legend(fontsize=20)


plt.title('Clusters of Sale Price', fontsize=20)
plt.xlabel('Sale Price', fontsize=20)
plt.ylabel('Rating', fontsize=20)
plt.show()

### Conclusion

From the three pairs of the predicted clusters, inference can be made that the cluster 1 and 2 could be more of the Listing Price and Sale Price. This cluster pattern can be used to understand the prices of products

Thank you

by :  Wasiu Ogunbowale