### IMPORTANT: Run the below cell once, then go to "Run -> Restart & clear cell outputs" before proceeding further

In [None]:
!pip install pandas==1.1.5
!pip install numpy==1.18.5

from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")


# Extracting Shopper Insights using Python & Pandas

### We shall explore & mine an eCommerce store's sales data to understand how to perform data manipulation, extract findings & patterns from the data using Python, Numpy & Pandas. We shall also showcase results using charting modules - Seaborn & Matplotlib, for conveying our results. 

### The data spans 5 lac entries across an year, for a UK based retailer. 

### The learners would get a walk-through & understanding of basic & advanced concepts of Python & few data-science modules through this dataset for further use in their own day-to-day projects.


# Part 1: Basics of Python

### Overview-

* Python syntax

* Data types

* Data structures

* Environment variables & working with files

* Control flow & logic

* Error handling

* Scope of variables

* Working directory & searching for files with Glob module

* Installing libraries online & offline

### Python Syntax

* Python code can be run both interactively(this notebook or via the console prompt) & also using scripts(.py)

In [None]:
# Print text using built-in ```print``` function

print("Hello BIU!")

In [None]:
# Direct evaluation of expression,  
# This ia a comment, a single line statement preceeded by # or
""" multiple lines enclosed within triple quotes is treated as a comment & not exectuted, unless assigned to a variable otherwise it is treated as multi-line string
"""

print(673762*62)


# Variable value assignment
a = 10

print(a+a)

### Pandas 

Pandas is a data wrangling module for Python. It treats data in either tabular format(also called a DataFrame) or as a series, while also offering a whole range of functions to help with data manipulation.

In [None]:
import pandas as pd

# Loading a dataset from a csv file and creating a dataframe
# specify encoding to deal with different formats
df = pd.read_csv('../input/ecommerce-data/data.csv', encoding = 'ISO-8859-1')

## Data Set Information:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## Attribute Information:

* InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* Description: Product (item) name. Nominal.Quantity: The quantities of each product (item) per transaction. Numeric.
* InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
* UnitPrice: Unit price. Numeric, Product price per unit in sterling.
* CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* Country: Country name. Nominal, the name of the country where each customer resides.


In [None]:
# head() is used to read top 5 rows and tail() is used to read bottom 5 rows of a dataframe
# The value can be changed --> head(10)

df.head(3)

In [None]:
# Accessing specific columns and rows of a dataframe

#There are 3 ways to access --> using loc, iloc and list

# Using loc
print('Using loc: \n', df.loc[:, ['InvoiceNo', 'Description']].head())    # rows,  columns

# Using iloc --> Using index
print('\n Using iloc: \n', df.iloc[:, [0,1,2]].head())    # rows,  columns

# Using list of columns
print('\n Using list of columns: \n', df[['InvoiceNo', 'Description']].head())    # rows,  columns

In [None]:
# Accessing individual columns, it can be used if column names do not contain spaces

df.InvoiceNo.head()

In [None]:
# Functions

# Built-in functions that are part of Python

print("hello")
print(len("63726372"))

# User defined functions, are created using the def(define) keyword, you can pass multiple arguments to them as well
# To invoke them write the function name & choose to either pass an argument if it accepts one or simply call with parentheses/()

def double_quantity(x):
    '''
    These are called document comments the other ones are using #
    x: int
    It takes x as input and returns 2 times x
    '''
    return x * 2


df['Double_quantity'] = df.Quantity.apply(double_quantity)
df.head()

### Data Types

In [None]:
# We can check the datatype of a varible using type function, we want to check the datatype assigned to a value in a column
# Python automatically assigns a datatype to a variable

# Integer datatype
print(type(df.Quantity[0])) #64 specifies the size

# Float --> Used to store decimal values
print(type(df.UnitPrice[0]))

# String
print(type(df.Description[0]))

# Boolean --> True(1) or False(0)
print(type(df.InvoiceNo[0] == 536365))

### Data Structures

#### Commonly used data structures are Lists, Tuples, Sets & Dictionaries

In [None]:
# Lists are denoted by [] & can contain the same or different data types within them-
# Lists are mutable --> the values can be modified
# We want to list out all the countries available in our dataset

# List of all countries
countries_list = list(df.Country.unique())
print(countries_list)

In [None]:
# List operations

# Remove an element from a list, we want to exclude a country from our analysis

countries_list.remove('United Kingdom') # It updates the existing list countries_list
print("After removing UK: \n", countries_list, "\n") # \n is used for next line \t for tab these are called 
print(countries_list.remove('France')) # If you try to print the object returned after operation it will return None
print("After removing France: \n", countries_list, "\n")


# Append an element to the list
# Creating a list
removed_countries = ['United Kingdom', 'France']
countries_list.append(removed_countries) # Append them as list 
countries_list = countries_list + removed_countries # Add elements to the list
print("After adding removed countries: \n", countries_list,  "\n")
countries_list.remove(removed_countries)

# Sorting a list
countries_list.sort(reverse=False) # reverse True-->Descending
print("After sorting: \n", countries_list)

In [None]:
# Tuples
# Tuples a consists of a number of values separated by commas & can be defined with or without parentheses
# They are immutable --> the values cannot be modified

countries_tuple = tuple(df.Country.unique())
print(countries_tuple) # Notice round bracket

countries_tuple[0:5] #Indexing in python always starts from 0, when we specify 5 it one value less 5 -->0,1,2,3,4 [0:(5-1)]

In [None]:
# Sets
# Sets are an unordered collection with no duplicate elements

country_list = list(df.Country)
print(len(country_list), "\n")

# Convert to set --> It will only keep unique entries
country_set = set(country_list)
print(len(country_set), "\n")
print(country_set, "\n") # Notice {} braces

# Basic uses include membership testing and eliminating duplicate entries. 
# Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.

# Check if India is present in set
print("India" in country_set, "\n")

countries_set_2 = {'Hong Kong', 'Iceland', 'European Community'}
india = {'India'}

print(country_set - countries_set_2, "\n") # Countries in country_set but not in countries_set_2
print(country_set & countries_set_2, "\n") # Countries in country_set and countries_set_2 (intersection)
print(country_set.union(india), "\n") # Countries in country_set and countries_set_2 (intersection)

In [None]:
# Dictionary
# Dictionary is a set of "key: value" pairs, with the requirement that the keys are unique (within one dictionary). 
# A pair of braces creates an empty dictionary: {}

# We want to select top 5 countries on the basis of their occurance in the dataset
country_dict = dict(df.Country.value_counts()[:6])
print(country_dict)
print("Key: ", country_dict.keys())
print("Values: ", country_dict.values(), "\n")


print(country_dict['Germany']) # Extract values using the key

# Operations like deleting a key-value pair is also possible using ```del```
del country_dict["France"]
print(country_dict)

india_dict = {'India' : 757574734}
country_dict.update(india_dict)
print(country_dict)
# For more operations on dictionaries refer to https://docs.python.org/3/tutorial/datastructures.html#dictionaries

### Environment variables & working with files

In [None]:
# Quite often you will end up working with libraries or databases where you might need-
# values from the environment directly for functionality or security of passphrase/keys/port numbers

import os
!set USERNAME=akash
print("$username",os.getenv("USERNAME","dev"))


In [None]:
# Writing to a file
with open("test.txt",'w') as f:
   f.write("my first file\n")   # \n denotes newline character
   f.write("This file\n")
   f.write("contains three lines\n")


!ls  # In the notebook commands prefixed with '!' are run by the host os, ```ls``` command lists all files & folders in current directory
# Output from ls shows that the file was created

In [None]:
# Reading a file
with open("test.txt",'rb') as f:
   # perform file operations
    print(f.read())


### File modes-
#### There are multiple modes with which a file can be interacted with-

* 'r'	This is the default mode. It Opens file for reading.
* 'w'	This Mode Opens file for writing. If file does not exist, it creates a new file. If file exists it truncates the file.
* 'a'	Open file in append mode. If file does not exist, it creates a new file.
* 'b'	This opens in binary mode.

### Control flow & logic

#### It is essential to be able to control the behaviour of your code, you may want something to run 100 times given a condition or iterate through a list etc. Control flow with loops & conditional logic help you do the same.

In [None]:
# 'For loop' is a tool to loop over a piece of code, given that a certain condition holds true
# The condition could be a range of values or a variable who's evaluation comes to True(Boolean) value

# Loop for 5 times-
for i in range(5):   # range starts from 0 until the number specified, hence in this case- 0,1,2,3,4
    print(i)
print('\n')    
    
# Iterating a list
for index,country in enumerate(removed_countries):
    print(index,country)
print('\n')    

# Enumerating a dictionary
for key,value in country_dict.items():
    print(key, "--", value)

In [None]:
# If Else conditions are standard ways of controlling whether to do something if an expression/variable is true otherwise do something else

# We want to find list of countries having more than 1000 entries
for key, value in dict(df.Country.value_counts()).items():
    
    if value > 1000:
        print(key, "--",value)
        
        
print('\nClassifying counties based on count:')
        
# We want to find list of countries having more than 1000 entries
for key, value in dict(df.Country.value_counts()[:11]).items():
    
    if value > 10000:
        print("More than 10k count")
    elif value <=10000 and value >= 2000:
        print("Between 10k and 2k ")
    else:
        print("Less than 2k")

In [None]:
# 'While loop' is used for repeatedly running code as long as an expression/variable is true
i = 0
while i<3:
    print(df.iloc[i, [1,2,3]])
    i += 1

### Break/Continue statement

In [None]:
# Break statement terminates the nearest enclosing loop, skipping the optional else clause if the loop has one.
# Continue statement continues with the next cycle of the nearest enclosing loop.

i = 0
while True:
    temp = df.iloc[i, [1,2,3]]
    print(temp)
    
    if temp.Quantity <= 6:
        print("Quantity less than 6")
        i += 1
        continue
        
    if temp.Quantity > 6:
        print("Quantity greater than 6")
        i += 1
        break
    

In [None]:
# We are trying to access few columns from a dataframe but it throws an error was the column quantity is incorrect, the correct name is "Quantity"
df.loc[:3, ['InvoiceNo', 'StockCode', 'quantity']]

In [None]:
df.columns

In [None]:
# We can use error handling to avoid this and print appropriate message to the user

try:
    print(df.loc[:3, ['InvoiceNo', 'StockCode', 'quantity']])
except Exception as e:   # You can also handle multiple exceptions differently by specifying it like-> except NameError:
    print("The specified column is not availabe in dataframe.")
    print("Please use capital letter for 1st character in column name or check if the column exists in the dataframe.")

In [None]:
print(demo_var)
print("This should not print")

In [None]:
# We havent initialized demo_var with any value, so the below will throw an error
try:
  print(demo_var)
except Exception as e:   # You can also handle multiple exceptions differently by specifying it like-> except NameError:
  print("An exception occurred",e)

print("This should print")

### Scope of variables

In [None]:
# All Variables have a scope within which they can be modified, in order to modify them in a function you need to use global keyword

# This would work since no modification takes place

x = 10
def bar():
    print(x)
bar()

# but this code will error out
x = 10
def bar():
    print(x)
    x = x + 1
    print(x)
bar()

In [None]:
# To solve the above issue we use `global` keyword which allows us to modify global variables

x = 10
def bar():
    global x
    print(x)
    x = x + 1
    print(x)
bar()

### Working directory & searching for files with Glob module

In [None]:
# Sometimes you might want to change your working directory to another folder in order to access files
# You can do this with `os` module chdir() & find out your current working directory using getcwd()

# Check current working directory.
import os
retval = os.getcwd()

# This is way of printing text & variables using f-strings with the use of `f` prefix & brackets/{} around variable/expression
print(f"Current working directory- {retval}")  

path = "/kaggle/input"

# Now change the directory
os.chdir( path )

# Check current working directory.
retval = os.getcwd()
print(f"Directory changed successfully- {retval}")

# Revert back using old path
path = "/kaggle/working"
os.chdir( path )
retval = os.getcwd()
print(f"Directory changed back successfully- {retval}")

In [None]:
# Finding files - You may want to search for all files that are of a certain extension, you can do that with `glob` module
# We create mock-files using linux's `touch` command
!touch 1.csv 2.csv 3.csv

import glob

list_of_csv = glob.glob('./*.csv')

print(f"{list_of_csv}")

### Installing libraries online & offline

In [None]:
# You can install python modules using pip(recursive acronym of "Pip Installs Packages")
# !pip install pandas

# In certain environments without internet connectivity you can download packages from pypi & install them offline
# Ex. Flask web-server from https://pypi.org/project/Flask/#files

!wget https://files.pythonhosted.org/packages/f2/28/2a03252dfb9ebf377f40fba6a7841b47083260bf8bd8e737b0c6952df83f/Flask-1.1.2-py2.py3-none-any.whl
    
!pip install Flask-1.1.2-py2.py3-none-any.whl

## Kaggle Platform Additional Info

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Part 2: Basic Data Science with Python using Numpy & Pandas

### Overview-
* Loading the dataset

* Numpy & Pandas basics

* Exploratory Data Analysis on dataset

* Results

# Context of Dataset
Company - UK-based and registered non-store online retail

Products for selling - Mainly all-occasion gifts

Customers - Most are wholesalers (local or international)

Transactions Period - **1st Dec 2010 - 9th Dec 2011 (One year)**

* #  *Loading the dataset*

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

import warnings
# current version of seaborn generates a bunch of warnings that we'll ignore
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

import missingno as msno # missing data visualization module for Python
import pandas_profiling

import gc
import datetime

%matplotlib inline
color = sns.color_palette()

In [None]:
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 100)

In [None]:
# specify encoding to deal with different formats
df = pd.read_csv('../input/ecommerce-data/data.csv', encoding = 'ISO-8859-1')

In [None]:
df.head()

In [None]:
# change the column names
df.rename(index=str, columns={'InvoiceNo': 'invoice_num',
                              'StockCode' : 'stock_code',
                              'Description' : 'description',
                              'Quantity' : 'quantity',
                              'InvoiceDate' : 'invoice_date',
                              'UnitPrice' : 'unit_price',
                              'CustomerID' : 'cust_id',
                              'Country' : 'country'}, inplace=True)

In [None]:
df.head()

* # *Numpy basics*

In [None]:
# Convert pandas to numpy array

arr = np.array(df.iloc[:10, [3,5]])
arr

### Basic array operations

In [None]:
arr.shape

In [None]:
np_reshape_arr = arr.reshape(2, 10)
print("Shape of array after reshaping: ", np_reshape_arr.shape)
np_reshape_arr

In [None]:
# Concatenate along the axis 
# 0 --> running vertically along rows

np_concat_0 = np.concatenate((arr, arr), axis=0)
print("Shape after concatinating along axis 0: ", np_concat_0.shape)

# 1 --> running horizontally along columns
np_concat_1 = np.concatenate((arr, arr), axis=1)
print("Shape after concatinating along axis 0: ", np_concat_1.shape)

In [None]:
# Create an array 0s, this is generally used when we have to initialize a parameter.

np_arr_zeros = np.zeros((10, 2))
print("Numpy array of zeros: \n", np_arr_zeros)

# Create an array 1s, this is generally used when we have to initialize a parameter.

np_arr_ones = np.ones((10, 2))
print("Numpy array of ones \n", np_arr_ones)

In [None]:
# Transforming an array & Matrix multiplication

print("arr shape: ", arr.shape)
print("np_arr_ones shape: ", np_arr_ones.shape)
print("np_arr_ones shape after transformation: ", np_arr_ones.T.shape)

np.matmul(arr, np_arr_ones.T)

In [None]:
# Stats operations using numpy

# Mean
print("Mean of an the entire array:", np.mean(arr))

# These operations can be applied along the axis
print("Mean of an the array along axis 0(vertically):", np.mean(arr, axis=0))
print("Mean of an the array along axis 1(horizontally):", np.mean(arr, axis=1))
      
print("Standard deviation of an the entire array:", np.std(arr))
print("Min of an the entire array:", np.min(arr))
print("Max of an the entire array:", np.max(arr))

In [None]:
# Sorting an array

# Default axis is 1(Horizontally), the same array is sorted if we try to assign it to a variable it stores as None
print(hex(id(arr)))
arr.sort(axis=1)
print("Sorted array:\n", arr)
print(hex(id(arr)))

arr_sorted_0 = arr.sort(axis=0)
print("Trying to assign sorted array to a variable:", arr_sorted_0)


In [None]:
# Conditions on array  -> This can be used in pandas to create new varibles using condition on existing variables

# Identifying values greater than 6 in array and assigning them 1
np.where(arr > 6, 1, 0)

In [None]:
# Assigning seed and generating random numbers

# Setting seed will generate same output from the same random function, change seed and try again.
# This is also used to initializing parameters
np.random.seed(1234)
print("Array of random numbers:\n", np.random.rand(2,3))
print("Integer array of random numbers from 50 - 100:\n", np.random.randint(low=50, high=100, size=10))

* # Pandas basics

Pandas is a data wrangling module for Python. It treats data in either tabular format(also called a DataFrame) or as a series, while also offering a whole range of functions to help with data manipulation.


In [None]:
# Importing pandas & numpy
import pandas as pd
import numpy as np

In [None]:
#Creating a series from a dataframe
# Series represents a single column of a dataframe
df_invoice_series = df.invoice_num
type(df_invoice_series)

In [None]:
# How to infer data type of a dataframe
# df_1.dtypes
df.dtypes

# Data manipulation- There are two ways in which you can perform manipulation to a Pandas object.
## 1. User defined functions, inline/anonymous functions called lambdas can also be used. More info at https://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.apply.html
## 2. Pandas built-in functions

In [None]:
# 2. Pandas built-in functions
# Pandas has a lot of statistical & linear-algebra functions ex. mean, median, sum, value_counts, rank, quantile
# More at https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

print("Mean unit price")
print(df.unit_price.mean())

print("\nTotal unit price")
print(df.unit_price.sum())

In [None]:
# Checking the missing values in all the columns

print("Missing values in all the columns:")
print(df.isna().sum())

# Replace missing values with a default value
df.description.fillna('No description', inplace=True) #inplace will replace the column in same dataframe and wont return a new dataframe
print("\nMissing values in all the columns after replacement:")
print(df.isna().sum())

# Replace missing values with a mean value
df.cust_id.fillna(np.mean(df.cust_id), inplace=True) 
print("\nMissing values in all the columns after replacement:")
print(df.isna().sum())

In [None]:
# Working with multiple dataframes using `merge`
# The data is generally available in multiple tables or excel sheet and we need to join multiple tables to create new variables
# The merge function helps in this task

print('Base data\n', df.head())

# Creating a new dataframe using a dictionary
# Creating a list of unique countries 
country_list = df.loc[:, 'country'].unique()
population = {}
for country in country_list:
    # Assigning a random value between 1000 - 10000000
    population[country] = np.random.randint(1000, 10000000)
    
population_df = pd.DataFrame.from_dict(population, orient='index', columns=["Population"]).reset_index() #orient specifies how the keys should be aligned either as columns or index
population_df.columns = ['country', 'Population']
print('\nPopulation data\n', population_df.head())


# Merging 2 dataframes based on common column
df = pd.merge(df, population_df, on='country', how='left') # Check out more samples on how to merge, ex. left/right etc
df.head()

In [None]:
# Grouping values in a dataframe

# A groupby operation involves some combination of splitting the object, applying a function, and combining the results. 
# This can be used to group large amounts of data and compute operations on these groups.
# df = pd.DataFrame({'Animal': ['Falcon', 'Falcon','Parrot', 'Parrot'],'Max Speed': [380., 370., 24., 26.]})
print(df.groupby(['invoice_num','cust_id']).sum().head(10))


In [None]:
# Accessing specific rows/columns of data using `loc` & 'iloc'

# Let's say you want to access all the rows which have orders from United Kindgom, you can use 'loc' & provide a statement to filter such rows
print(df.loc[df['country'] == 'United Kingdom'])

# In case you need to access a subset of rows only using their index instead of columnar values you can do so with 'iloc'
print(df.iloc[7:10])

In [None]:
# Saving a dataframe
# Pandas allows you to save a dataframe in multiple formats like csv, excel, pickle, sql, JSON etc

df.to_csv('output.csv') # You can customize aspects like headers or indexes to keep as well

# Check out more samples at https://pandas.pydata.org/pandas-docs/version/0.18.1/api.html#serialization-io-conversion

# Data Cleaning 

In [None]:
df.info()

## Check missing values for each column 

In [None]:
# check missing values for each column 
df.isnull().sum().sort_values(ascending=False)

In [None]:
# check out the rows with missing values
df[df.isnull().any(axis=1)].head()

In [None]:
# change the invoice_date format - String to Timestamp format
df['invoice_date'] = pd.to_datetime(df.invoice_date, format='%m/%d/%Y %H:%M')

In [None]:
# change description - UPPER case to LOWER case
df['description'] = df.description.str.lower()

In [None]:
df.head()

## Remove rows with missing values

In [None]:
# df_new without missing values
df_new = df.dropna()

In [None]:
# check missing values for each column 
df_new.isnull().sum().sort_values(ascending=False)

In [None]:
df_new.info()

In [None]:
# change columns tyoe - String to Int type 
df_new['cust_id'] = df_new['cust_id'].astype('int64')

In [None]:
df_new.head()

In [None]:
df_new.info()

In [None]:
df_new.describe().round(2)

* ## Remove Quantity with negative values

In [None]:
df_new = df_new[df_new.quantity > 0]
df_new = df_new[df_new.unit_price >= 0]

In [None]:
df_new.describe().round(2)

## Add the column - amount_spent

In [None]:
df_new['amount_spent'] = df_new['quantity'] * df_new['unit_price']

In [None]:
# rearrange all the columns for easy reference
df_new = df_new[['invoice_num','invoice_date','stock_code','description','quantity','unit_price','amount_spent','cust_id','country']]

## Add the columns - Month, Day and Hour for the invoice

In [None]:
df_new.insert(loc=2, column='year_month', value=df_new['invoice_date'].map(lambda x: 100*x.year + x.month))
df_new.insert(loc=3, column='month', value=df_new.invoice_date.dt.month)
# +1 to make Monday=1.....until Sunday=7
df_new.insert(loc=4, column='day', value=(df_new.invoice_date.dt.dayofweek)+1)
df_new.insert(loc=5, column='hour', value=df_new.invoice_date.dt.hour)

In [None]:
df_new.head()

# Hands on exercises-

## The dataframe ```df_new``` contains all the columns that are needed for the below exercises


1. Which customer places the most orders & which country are they from?
  Also plot the number of orders for the top 5 customers

2. Which customers spent the most? Also plot the money spent for top 5 customers

3. Plot how many orders per month/day/hour of the day

4. Certain items are priced at $0, as a free gift. Try out different chart types(scatterplot, boxplot etc) to see distribution of unit prices & also plot the frequency of free items being given for every month

5. Find out which country places the most orders & plot the same

6. Find out how did each country spend & plot the same, also try removing the top country & then plotting again to see the updated data