# SuperStore Dataset Analysis
# Re:从零开始的Python数据分析(实战篇）

> Note: the English sentences in this article are not completely corresponding to Chinese

> 注：本文中的英语语句并不与中文完全对应

<font size=3> In this notebook, we'll see if we can use Python to analyse some Supermarket marketing data,using datas from our dataset.</font>

<font size=3> 在这篇文章中，我们将要利用Python对一个超市从2011年到2015年的销售数据进行一些分析，看看我们是否能从庞大的数据中发掘出一些有用的信息。</font>

<font size=3> Along the way, we'll analyze the overall sales volume and profit, display the list of regional sales, find out the top ten commodities in sales volume, sales volume and profit, and draw the proportion chart of different types of customers. </font>

<font size=3> 下文中，我们将会进行对整体销售额和利润的分析，对地区销售额的列表展示，找出销量、销售额、利润前十的商品以及画出不同类型客户的占比图。 </font>

<font size=4> Ready? Let's get started. </font>

<font size=4> 准备好了吗？让我们开始吧！ </font>

## （1）Import libraries

## （一）导入数据分析库

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
plt.rcParams['font.sans-serif'] = ['SimHei']
warnings.filterwarnings('ignore')      

<font size=3> In addition to importing some commonly used libraries, we also import a warming library to close some messy and annoying messages. (almost useless) </font>

<font size=3> 除了导入一些常用的库以外，我们另外导入一个warnings库来关闭一些乱糟糟看着烦人的提示信息。(基本上就是没什么用)</font>

## (2) Data Preprocessing

## (二) 数据预处理

<font size=3> When loading the dataset, we use the'ISO-8859-1'encoding method to make it easier to adapt the Chinese encoding method in future data display.
    
At the same time, we rename some column names that do not conform to the Python naming conventions and use the underline naming method uniformly.</font>

<font size=3> 在加载数据集的时候，我们使用'ISO-8859-1'编码方式，便于以后的数据展示中适应中文编码方式。
    
同时，我们对一些不符合Python的命名规范的列名进行一下重命名，统一采用下划线命名法。</font>

In [None]:
df = pd.read_csv('../input/superstore-data/superstore_dataset2011-2015.csv', encoding='ISO-8859-1')

df.rename(columns=lambda x: x.replace(' ', '_').replace('-', '_'), inplace=True)
df.head()

<font size=3> Check the data types for each column. </font>

<font size=3> 查看各列的数据类型。 </font>

In [None]:
df.dtypes

<font size=3> As we can see from the above, most of them are of type object,'Sales','Profit'and so on are float type and do not require processing.

However, the order date should be of type datetime and format conversion is required.

To facilitate categorical search, add'Year'and'Month'columns after the data. </font>

<font size=3> 从上面看到，大部分为object类型，销量、销售额、利润等为数值型，不需要进行数据类型处理。但下单日期应为datetime类型，需要进行格式转换。

为了便于分类查找，在数据后增加‘Year’和‘Month’两列。</font>

In [None]:
df["Order_Date"] = pd.to_datetime(df["Order_Date"])
df["Ship_Date"] = pd.to_datetime(df["Ship_Date"])
df['Year'] = df["Order_Date"].dt.year
df['Month'] = df['Order_Date'].values.astype('datetime64[M]')
df.head()

<font size=3> Check for rows or columns with more missing values. </font>

<font size=3> 查看有没有缺失值较多的行或列。</font>

In [None]:
df.isnull().sum()

<font size=3> We can see that the missing part is 'Postal_Code',this column represents the postcode information, which does not have much effect on our analysis and can be deleted directly. </font>

<font size=3> 我们可以看到缺失过多的是'Postal_Code'，此列表示邮编信息，对我们的分析没有太多作用，可以直接删除。 </font>

In [None]:
df.drop(["Postal_Code"], axis=1, inplace=True)

<font size=3> Check for duplicate values and outliers. </font>

<font size=3> 查看有没有重复值和异常值较多的行或列。</font>

In [None]:
df.describe()

In [None]:
df.duplicated().sum()

<font size=3> Here we do not find duplicate values and obvious outliers, so we do not need to process the data. </font>

<font size=3> 这里我们没有发现重复值和明显的异常值，不需要对数据进行处理。 </font>

## (3) Data analysing

## (三) 数据分析

### Print ‘Order_date' , 'Sales' , 'Profit' , 'Year' , 'Month' information

### 取整体销售情况的部分数据，包含下单日期、销售额、利润、年份、月份信息，并打印

In [None]:
df1=df[['Order_Date','Sales','Profit','Year','Month']]
df1

### According to the 'Year', 'Month' to group statistics and sum

### 按照年份、月份对数据进行分组统计与求和

In [None]:
df1.groupby('Year')['Month'].value_counts()

In [None]:
sales=df1.groupby(['Year','Month']).sum()
sales

### Split the above data into one table each year

### 对以上数据进行拆分，每年切分为一张表格

<font size=3> Here we print the form for 2011 to show the effect. </font>

<font size=3> 这里打印出2011年的表格来展示效果。 </font>

In [None]:
year_2011 = sales.loc[(2011,slice(None)),:].reset_index()
year_2012 = sales.loc[(2013,slice(None)),:].reset_index()
year_2013 = sales.loc[(2013,slice(None)),:].reset_index()
year_2014 = sales.loc[(2014,slice(None)),:].reset_index()
year_2011

### Build a profit statement(take the annual profit data from the overall data)

### 构建利润表(取总体数据中的每年利润数据)

In [None]:
Profit=pd.concat([year_2011['Profit'],year_2012['Profit'],year_2013['Profit'],year_2014['Profit']],axis=1)
Profit

### Rename the row and column names of the profit statement

### 对利润表的行名和列名重命名

In [None]:
Profit.columns=['2011','2012','2013','2014']
Profit.index=['Jau','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
Profit

### Calculate annual profit and chart display

### 计算年度利润，并用图表展示

In [None]:
Sum=Profit.sum()
Sum.plot(kind='barh')

### Use a bar chart to show the annual sales by region

### 用柱状图表显示各个地区每年的销售情况

<font size=3> Because this is a global supermarket and there are markets in different regions, take a look at sales between regions.The X-axis represents 'Market' , the y-axis represents 'Sales' , and the legend is 'Year' . </font>

<font size=3> 因为这是一家全球超市，在不同地区都会有市场，所以看一下不同地区之间的销售情况。x轴表示‘Market’，y轴表示销售额，图例为年份。 </font>

In [None]:
Market_Year_Sales = df.groupby(['Market', 'Year']).agg({'Sales':'sum'}).reset_index()
Market_Year_Sales.head()

In [None]:
sns.barplot(x='Market',y='Sales',hue='Year',data=Market_Year_Sales)
plt.title('Market_Sales')

### Analyze commodities to identify the top 10 items according to 'Product_ID' , "Sales' and 'Profits'

### 对商品进行分析，分别找出并分析销量、销售额、利润前10的商品

In [None]:
productId_count = df.groupby('Product_ID').count()['Customer_ID'].sort_values(ascending=False)
productId_count.head(10)

In [None]:
productId_amount = df.groupby('Product_ID').sum()['Sales'].sort_values(ascending=False)
productId_amount.head(10)

In [None]:
productId_Profit= df.groupby('Product_ID').sum()['Profit'].sort_values(ascending=False)
productId_Profit.head(10)

### Show different types of customer percentages in pie charts

### 用饼状图展示出不同类型的客户占比

In [None]:
df["Segment"].value_counts().plot(kind='pie',shadow=True)

## (4) Conclusion 

## (四) 得出结论

<font size=4> 1.The overall annual profit of this supermarket is increasing year by year, and its business seems to be going well.Meanwhile, profits in the second half of the year will generally be higher than those in the first half.Was it due to Black Friday?(LOL) </font>

<font size=4> 1.这家超市的总体年利润逐年递增，看起来他的日子过的不错。同时，下半年的利润总体上会高于上半年的利润。莫非是双十一或黑五？(大雾） </font>

<font size=4> 2.Analysis of regional sales shows that Asia-Pacific is the region with the highest sales volume, while Canada is the region with the lowest sales volume.Our Canadian friends seems to be very reluctant to spend in this supermarket. </font>

<font size=4> 2.对地区销售额的分析可知,销售量最高的地区是亚太地区，而销售量最低的地区是加拿大。看来我们的加拿大老兄在这家超市消费的意愿不是很强。</font>

<font size=3> There are two possibilities for this:

>Firstly,the supermarket doesn't have enough branches in Canada.The solution is to open a few more stores, perhaps to increase sales a little in this area.
    
>Secondly,the localization of this supermarket is not enough, and the local publicity is not complete.The solution is to further promote the localization process and increase local consumer appetite, which can also drive sales growth. </font>

<font size=3> 对于此种情况可能有两种可能：
> 一是这家超市在加拿大地区开的分店还不够多。解决措施就是多开几家店，或许可以提升一点此地区的销售量。
 
> 二是这家超市的本土化还做的不够到位，在当地的宣传也不是很完备。解决措施既是进一步推进本土化进程，让当地人的消费意愿更加强烈，这样也可带动销售量的增长。 </font>

<font size=4>3.Check for the analysis of commodities, we can easily find that:
>Most of the sales were office supplies.
    
>Most of the highest sales are electronics and furniture, which are more expensive.
    
>Half of the top 10 profitable items are electronics products, so you can focus on increasing sales of these products to increase overall profits. </font>

<font size=4> 3.观察对商品的分析，我们不难发现：
>销量最高的大部分是办公用品。
    
>销售额最高的大部分是电子产品、家具这些单价较高的商品。
    
>利润前10的商品有一半是电子产品，可以重点考虑提升这部分产品的销量，来增加整体的利润。 </font>

<font size=4> 4.Customer type analysis shows that in the past four years, ordinary consumers account for the largest proportion of customers.It can increase the amount of commodities imported.You can also focus on this type of audience when you do your publicity. </font>

<font size=4> 4.客户类型分析显示这四年来，普通消费者的客户占比最多。可以增加对日用品的进货量。在做宣传时也可着重在此类受众身上下功夫 </font>

## So that's the all for this article. Of course, I believe that this is enough for your reference to make your own data feature analysis. 


## 那么这篇文章就写到这里了。当然我相信这足够供你参考后去作出属于你自己的数据特征分析，并享受数据分析给你带来的乐趣.

## Enjoy it!