In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# In this analysis session, we will exploratorily analyze this dataset and conclusively retrieve insights.
# As this dataset is about records relevant to retail industry, we will be more subjective towards certain aspects of business helpful to revamp a new course of action.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


retailsheet = pd.read_csv('../input/retaildataset/supermarket_sales - Sheet1.csv')
df = pd.DataFrame(retailsheet)                #Now df is an object containing DataFrame.

**We create a DataFrame from csv file using pandas as pd. Using Pandas DataFrame, we build a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields due to it's smooth and enhanced functionality.**
 
**Note: As this is an exploratory data analysis using csv file, we will just import pandas and exclude numpy as numpy is used for operating in more larger datasets.
** 
Initially, we will proceed with overview of data using following functions:****

In [None]:
print('Overview of Data \n')

print(df.info())

**By this function of info(), we can check the total number of columns and rows with non-Null values.
With this simple method we can intuitively check if all null values are equivalent to each other then there are no null values.**

In [None]:
print(df.shape, "This is the size of the DataFrame")
#With this function we can easilly find the rows and columns in a dataframe.
#This method returns a Tuple with which if you need you can further use it to view or manipulate programmatically.

**The above method of shape returns a tuple containing number of rows and columns.**
**This is the Format of returned result------> (Rows, Columns)**

In [None]:
print(df.describe())

**The above function of describe() is used to Generate descriptive statistics for a DataFrame or it's objects.
Numerical data is expressed in numerics and categorical is everything else whether discrete or goruped**

**Sufficiently this function of describe() can be used to check the unit price by viewing minimum or maximum price at which the particular product is being sold whether as per standards or not.
We can also view the average rating of entire business which is 6.97 on the scale of 10, this can be used to surveil and regulate quality control."**


# As this data set is not consisting any null or duplicate values, we have noting to remove.
# But still I will let you know how to deal with such null and duplicate values.


**Removing Null values from Entire DataFrame:**
> df = df.dropna(inplace=True) 

**Removing Null Values from one column:**
> df['City'].dropna(inplace=True)            
> df.dropna(subset=['Invoice ID', 'Payment'])

**Drop an Entire column though just 1 value is missing.**
> df.dropna(axis='columns')

# Checking for Duplicates:
> print(df.duplicated(), " \n This is for checking if rows of dataframe are having duplicate entries \n")   

**#Checking for duplicates in one column:**
> print(df['Invoice ID'].duplicated(), " \n This is for checking duplicates in a column \n")   #As an Invoice ID has to be unique, I have logically checked for just one column.

**#Checking for multiple columns**
> print(df[['Rating', 'Branch']].duplicated(), "\n This is for checking duplicates in mutiple columns. \n")

**#Checking for limited rows:**
> print(df['Rating'][:11].duplicated(), " \n This is to check in limited rows \n")

In [None]:
list(df.columns)  # "\n These ar the columns and we can be more intuitivelye evaluate further aspects of business")
# print(str(list(df.columns)))
for i in enumerate(list(df.columns)):
    print(i)

# Above are the columns of the dataset and now to simplify the process of analysis by finding following things:
**a) Finding average ratings of all branches and highest among all branches.**

**b) Statistics relevant to the Quantity sold in all branches**

**c) Total amount spent by customers in all the branches.**

**d) Finding who spends the most, whether male or female among gender at different branches.**

**e) Finding which branch has more female or male visitors**

**f) Finding which product line is at uptrend within this data's date range.**

**g) Conclusively, membership analysis which will provide some suggestions.**

**Let's get started with the first task of finding average ratings of all branches and which branch has highest or lowest ratings on an average.**

In [None]:
BranchRatings = df['Rating'].groupby(df['Branch'])
BranchRatings.mean()

**These are the average ratings of the branches which can be seen above.**
**Branch "C" has the highest and "B" has the lowest.**


In [None]:
Q = df['Quantity'].groupby(df['Date'])
# new.sum().describe()
Max = Q.idxmax()
print(Max.sort_values(ascending=True))

**The above measure is of the dates within which the quantities have been either sold higher or lower.**
**It is Noteable that on 25th March 2019 the quantity sold was the lowest.**
**And on 18th February 2019, the quantity was sold at peak.**

In [None]:
Q = df['Quantity'].groupby(df['Branch'])
# new.sum().describe()
print(Q.sum())
print("\n",df['Quantity'].sum(), "These are the total units sold within the date range of Dataset.")

**We can almost see an equivalence of quantities sold in three of the branches with no major difference.
Therefore, we proceed to avoid nuisance in data which can be irrelevant.**

In [None]:
gross_income = df['gross income'].groupby(df['Branch'])
print(gross_income.sum())

**Gross income of each branch gained within the date range of this DataSet.**
**Branch "C" is at highest and Branches "A" and "B" are at an equivalence  of gross incomes.**

In [None]:
Total_amount = df['Total'].groupby(df['Branch'])
print(Total_amount.sum(), "\n")
print(df['Total'].sum(), "is the sum of total amount spent by all customers throughout the business")

**Total income of each branch gained within the date range of this DataSet.**
**Again Branch "C" is at highest and Branches "A" and "B" are at an equivalence of total amount of incomes.**

In [None]:
Gender_comp = df['Total'].groupby(df['Gender'])
print(Gender_comp.sum(), "\n")

print(df['Total'].sum(), "is the sum of total amount spent by all customers throughout the business")

**In this business, the higher amounts have been spent by the Female shoppers.**

**Below is a grouped table on the basis of Branches and gender.**

In [None]:
xf = df.groupby(["Branch","Gender"]).sum()
xf

**With the table above, we can consider in Branch "A" and "C", female customers have spent atmost and in Branch "B" Male customers have spent at most.**
**The influx of customers is a vast field and can be influenced by multiple factors one of which can be geographical constraint or ease.**

In [None]:
df['Gender'].value_counts()

In [None]:
e = df.groupby(df['Gender'])
e.count()

**From the two above tables, we can have a wider information which is retreived on the basis of gender.**
**Like 501 invoices have been generated to female customers and 499 to male.**
**We can considered that more female customers have been invoiced compared to the males.**

In [None]:
xf = df.groupby(["Branch","Gender"]).count()
xf

**Now, this table gives you a discrete information about visitors whether male or female across different branches.**
**In Branch "C", the female customers have been invoiced at most.**

In [None]:
Productlin = df['Product line']
Productlin.value_counts()

**In an entire business, Fashion accessories had trended atmost during dataset's date range.**
**Below, we will check branch wise.**

In [None]:
Prods = df['Product line'].groupby(df['Branch'])
Prods.value_counts()

**In Branch "A", we can see Home and Lifestyle had been at trend.**
**In Branch "B", we can see Fashion and Accessories had been at trend.**
**In Branch "C", we can see Food and Beverages had been at trend.**

In [None]:
fashionb = df['Product line'].groupby([df["Branch"], df["Gender"]])
fashionb.value_counts()

**Herein, we have further split the product line data on the basis of gender purchased at the different branches.**
**It is specifically customers' attribute oriented task and is done further on many aspects to find in brief.**

# Membership Analysis

In [None]:
customer = df['Customer type']
payment = df['Payment']
print(df[(customer == 'Member') & (payment == "Cash")].count())


In [None]:
print(df[(customer == 'Member') & (payment == "Credit card")].count())

In [None]:
df[(customer == 'Member') & (payment == "Ewallet")].count()

**It is notable who have memebership are more confortable with Credit card as a method of payment and the business can tie up with respective finanical institutions and provide more offers to customers who have membership.**