# PVC Retail Data Analysis

In this notebook we will analyze some key variables of the data. We will compare the wins/losses ratio and distribution for each variable and finally we will highlight the most notable results.

# 0. Program setup

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

filepath='/kaggle/input/retail-analytics/RETAIL_ANALYTICS.csv'

# 1. Data cleaning

First's let's show the raw data.

In [None]:
RawData=pd.read_csv(filepath)
RawData

Now, let's to discard the variables we won't work with. Also, let's change the variable "Month" into "Season" to see we see some patterns related to each season.

In [None]:
Data=RawData.drop(columns=['Enquiry Date',
                           'Enquiry Id',
                           'State',
                           'Pincode',
                           'First Action-Call made',
                           'Date DD/MM/YY', 
                           'First Action-Call Status',
                           'Date of Appointment (DD/MM/YY)', 
                           'Second Action-Customer Meeting',
                           'Date DD/MM/YY.1', 
                           'Second Action-Call Status',
                           'Third Action-Quote Given', 
                           'Date DD/MM/YY.2', 
                           'Q Val. (Rs. Lac)',
                           'Quote QTY', 
                           'Date DD/MM/YY.3',
                           'Order Val. (Rs. Lac)', 
                           'Order QTY',
                           'Quote ID (as per match to CCC Records)',
                           ' Remarks-Brand and value if lost to UPVC ', 
                           'Second Action-Call Status.1',
                           'Benefits']
                 )
NumWindowsCat=pd.api.types.CategoricalDtype(categories=['1 to 5',
                                             '6 to 10',
                                             '11 to 20',
                                             '21 to 40',
                                             '41 to 100',
                                             '100 +'],
                                  ordered=True)
Data['No of Windows']=Data['No of Windows'].astype(NumWindowsCat)
Data['Month']=Data['Month'].str.replace(r"'1[67]",'')

def MonthSeason(month):
    Season=''
    if month in ('April','June'): Season='Spring'
    elif month in ('July', 'Aug', 'Sept'): Season='Summer'
    elif  month in ('Oct', 'Nov', 'Dec'): Season='Autumn'
    elif  month in ('Jan'): Season='Winter'
    return Season

SeasonCat=pd.api.types.CategoricalDtype(categories=['Winter','Spring','Summer','Autumn'], ordered=True)
    
Data['Season']=Data['Month'].apply(lambda month: MonthSeason(month))
Data['Season']=Data['Season'].astype(SeasonCat)

Data2=Data.drop(columns=['Month'])

Data2

# 2. Data analysis

### Won vs Lost orders
Now that we have the data ready to analyze, let's start by analyzing the win/loss ratio and the fraction of cases for both wins and losses for each value of each categorical variables and the distribution of the price per square foot values for wins and losses.

In [None]:
def move_legend(ax, new_loc, **kws):
    old_legend = ax.legend_
    handles = old_legend.legendHandles
    labels = [t.get_text() for t in old_legend.get_texts()]
    title = old_legend.get_title().get_text()
    ax.legend(handles, labels, loc=new_loc, title=title, **kws)
    
for col1 in ('City', 'Type of Project', 'No of Windows', 'Source', 'Zone', 'Dealer Name','Season'):
    fig, ax = plt.subplots(ncols=2,figsize=(12,6))
    fig.suptitle('{}: Win/Loss ratio for each value and value distribution on wins and losses'.format(col1))
    sns.histplot(Data2, x=col1, hue='Status', multiple='stack',ax=ax[0])
    sns.histplot(Data2, x='Status', hue=col1, multiple='fill',ax=ax[1])
    for tick in ax[0].get_xticklabels():
        tick.set_rotation(45)
    move_legend(ax[1],'center right',bbox_to_anchor=(1.45, 0.5))

fig, ax = plt.subplots(ncols=2,figsize=(12,6))
fig.suptitle('Price Per Sft Win/Loss Distribution')
sns.stripplot(data=Data2, y='Price Per Sft',x='Status',ax=ax[0])
sns.histplot(data=Data2, x='Price Per Sft',hue='Status',ax=ax[1], multiple='stack',binwidth=100)
plt.show()

### Review:
- Chennai and Kanchipuram are the cities with greater rate of wins.
- The type of project doesn't influence the order outcome.
- Orders with a greater number of windows have a better probability of success, but they are rare.
- Orders made by internet have more probability of success than any other source, but the number of wins is closely followed by friends or family sources.
- Direct dealer is by far the one with the greater rate of won orders. Despite the high number of orders, the rate of success of Windoors is poor.
- Summer is the seeason with most orders, but autumn is the one with greater rate of success.
- All the lost orders had a price per square foot between 850-1250 $.

Now let's focus on the price per square foot win/loss distribution and how the other variables affect this.

In [None]:
for col1 in ('City', 'Type of Project', 'No of Windows', 'Source', 'Zone', 'Dealer Name','Season'):
    sns.catplot(data=Data2, x=col1, y='Price Per Sft', hue='Status')
    plt.xticks(rotation=45)
    plt.show()


### Review:
- The same patterns shown in the previous analysis are here. No order with a price per square foot outside the 850-1250 $ has been lost.

### Lost orders

Now let's focus of the reason of order losses.

In [None]:
DataL=Data2[Data2['Status']=='Lost'].drop(columns=['Status','Aesthetics', 'Reduce Street Noise',
       'Low Maintenance', 'Monsoon Proof', 'Better Lighting',
       'Reduce AC Energy Cost'])
DataL[' Order Recd\Lost ']=DataL[' Order Recd\Lost '].replace({'Delivery Time Not Possible':'Delivery Time',
                                   'Lost to UPVC (Provide details)':'Lost to UPVC',
                                   'Product Issue-Design/Type':'Product'})
DataL['Remarks']=DataL['Remarks'].replace(regex={r'Price [Ii]ssue':'Price',
                                                 'No Requirment':'No Requirement',
                                                 ' Infeasibility':'',
                                                 ' Not Feasible':'',
                                                 ' Infeasibe My':'',
                                                 '-Awareness Call':'',
                                                 ' Constraint':'',
                                                 'Delivery Iss[uv]e':'Delivery'})

The distribution of the details of each loss reason will be shown below.

In [None]:
sns.displot(data=DataL, x=' Order Recd\Lost ', hue='Remarks',multiple='stack')
plt.xticks(rotation=45)
plt.show()
sns.displot(data=DataL, x=' Order Recd\Lost ', hue='Remarks',multiple='fill')
plt.xticks(rotation=45)
plt.show()

### Review:
- Customer's preference to UPVC is the main reason of order losses.
- The reason orders lost to other materials are mainly the price and the customer's budget.

Now let's see if the number of windows has some influence to the losses due to economical reasons.

In [None]:
def iseconomical(issue):
    if issue in ('Price','Budget'): return 'Yes'
    else: return 'No'
DataL['Economical Issue?']=DataL['Remarks'].apply(lambda issue: iseconomical(issue))
plot=sns.catplot(data=DataL, x=' Order Recd\Lost ', y='Price Per Sft',hue='Economical Issue?', col='No of Windows', col_wrap=3)
plot.set_xticklabels(rotation=45)
plt.show()

### Review:
- When the order consists of more than 5 windows, the loss reason is almost always economical, which is expected.

Finally let's analyze the most preferred benefits of the successful orders.

In [None]:
DataWHist=pd.DataFrame(columns=["Benefits"])
benefits=[]
counts=Data2[Data2==1].count()
for col in ('Aesthetics', 'Reduce Street Noise', 'Low Maintenance', 'Monsoon Proof', 'Better Lighting', 'Reduce AC Energy Cost'):
    for i in range(0,counts[col]):
        benefits.append(col)
DataWHist['Benefits']=benefits
sns.displot(data=DataWHist, x='Benefits')
plt.xticks(rotation=45)
plt.show()

### Review:
- Street noise reduction and low mantainance are the favorites of the customers, whereas better lighting and monsoon proof are the less demanded ones.

# 3. Conclusions

- Autumn is the season with most rate of success, while summer, despite being the season with most orders, it has the worst rate.
- Internet source is the one with greater success rate and also the most common order source, followed closely by family and friends, which had a few less orders and also a bit lower rate of success.
- Direct dealer is by far the one with greater order success rate, followed by Sunbird, but with more less orders. Windoors, despite being the 2nd of number of orders, the success rate is much lower than Direct.
- Every single lost order had a price per square foot between 850 and 1250 $.
- The times PVC loses to aluminium, wood and especially to UPVC is due to economical reasons.
- When the order consists of more than 5 windows, the loss reason is almost always economical, which is expected.
- Street noise reduction and low mantainance are the favorite benefits of the customers, whereas better lighting and monsoon proof are the less demanded ones.

## Author comments

This is my first attempt of analyzing business data. I would be gratefull if I receive some feedback to improve my methods (also my scripting).