# Extracting Shopper Insights using Python & Pandas

### We shall explore & mine an eCommerce store's sales data to understand how to perform data manipulation, extract findings & patterns from the data using Python, Numpy & Pandas. We shall also showcase results using charting modules - Seaborn & Matplotlib, for conveying our results. 

### The data spans 5 lac entries across an year, for a UK based retailer. 

### The learners would get a walk-through & understanding of basic & advanced concepts of Python & few data-science modules through this dataset for further use in their own day-to-day projects.


# Part 1: Basics of Python

### Overview-

* Python syntax

* Data types

* Data structures

* Environment variables & working with files

* Control flow & logic

* Error handling

* Scope of variables

* Working directory & searching for files with Glob module

* Installing libraries online & offline

### Python Syntax

* Python code can be run both interactively(this notebook or via the console prompt) & also using scripts(.py)

In [None]:
# Print text using built-in ```print``` function

print("Hello BIU!")

In [None]:
# Direct evaluation of expression,  
# This ia a comment, a single line statement preceeded by # or
""" multiple lines enclosed within triple quotes is treated as a comment & not exectuted, unless assigned to a variable otherwise it is treated as multi-line string
"""

print(673762*62)


# Variable value assignment
a = 10

print(a+a)

### Pandas 

Pandas is a data wrangling module for Python. It treats data in either tabular format(also called a DataFrame) or as a series, while also offering a whole range of functions to help with data manipulation.

In [None]:
import pandas as pd

# Loading a dataset from a csv file and creating a dataframe
# specify encoding to deal with different formats
df = pd.read_csv('../input/ecommerce-data/data.csv', encoding = 'ISO-8859-1')

## Data Set Information:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## Attribute Information:

* InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* Description: Product (item) name. Nominal.Quantity: The quantities of each product (item) per transaction. Numeric.
* InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
* UnitPrice: Unit price. Numeric, Product price per unit in sterling.
* CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* Country: Country name. Nominal, the name of the country where each customer resides.


In [None]:
# head() is used to read top 5 rows and tail() is used to read bottom 5 rows of a dataframe
# The value can be changed --> head(10)

df.head()

In [None]:
# Accessing specific columns and rows of a dataframe

#There are 3 ways to access --> using loc, iloc and list

# Using loc
print('Using loc: \n', df.loc[:, ['InvoiceNo', 'Description']].head())    # rows,  columns

# Using iloc --> Using index
print('\n Using iloc: \n', df.iloc[:, [0,2]].head())    # rows,  columns

# Using list of columns
print('\n Using list of columns: \n', df[['InvoiceNo', 'Description']].head())    # rows,  columns

In [None]:
# Accessing individual columns, it can be used if column names do not contain spaces

df.InvoiceNo.head()

In [None]:
# Functions

# Built-in functions that are part of Python

print("hello")
print(len("63726372"))

# User defined functions, are created using the def(define) keyword, you can pass multiple arguments to them as well
# To invoke them write the function name & choose to either pass an argument if it accepts one or simply call with parentheses/()

def double_quantity(x):
    '''
    These are called document comments the other ones are using #
    x: int
    It takes x as input and returns 2 times x
    '''
    return x * 2


df['Double_quantity'] = df.Quantity.apply(double_quantity)
df.head()

### Data Types

In [None]:
# We can check the datatype of a varible using type function, we want to check the datatype assigned to a value in a column
# Python automatically assigns a datatype to a variable

# Integer datatype
print(type(df.Quantity[0])) #64 specifies the size

# Float --> Used to store decimal values
print(type(df.UnitPrice[0]))

# String
print(type(df.Description[0]))

# Boolean --> True(1) or False(0)
print(type(df.InvoiceNo[0] == 536365))

### Data Structures

#### Commonly used data structures are Lists, Tuples, Sets & Dictionaries

In [None]:
# Lists are denoted by [] & can contain the same or different data types within them-
# Lists are mutable --> the values can be modified
# We want to list out all the countries available in our dataset

# List of all countries
countries_list = list(df.Country.unique())
print(countries_list)

In [None]:
# List operations

# Remove an element from a list, we want to exclude a country from our analysis

countries_list.remove('United Kingdom') # It updates the existing list countries_list
print("After removing UK: \n", countries_list, "\n") # \n is used for next line \t for tab these are called 
print(countries_list.remove('France')) # If you try to print the object returned after operation it will return None
print("After removing France: \n", countries_list, "\n")


# Append an element to the list
# Creating a list
removed_countries = ['United Kingdom', 'France']
countries_list.append(removed_countries) # Append them as list 
countries_list = countries_list + removed_countries # Add elements to the list
print("After adding removed countries: \n", countries_list,  "\n")
countries_list.remove(removed_countries)

# Sorting a list
countries_list.sort(reverse=False) # reverse True-->Descending
print("After sorting: \n", countries_list)

In [None]:
# Tuples
# Tuples a consists of a number of values separated by commas & can be defined with or without parentheses
# They are immutable --> the values cannot be modified

countries_tuple = tuple(df.Country.unique())
print(countries_tuple) # Notice round bracket

countries_tuple[0:5] #Indexing in python always starts from 0, when we specify 5 it one value less 5 -->0,1,2,3,4 [0:(5-1)]

In [None]:
# Sets
# Sets are an unordered collection with no duplicate elements

country_list = list(df.Country)
print(len(country_list), "\n")

# Convert to set --> It will only keep unique entries
country_set = set(country_list)
print(len(country_set), "\n")
print(country_set, "\n") # Notice {} braces

# Basic uses include membership testing and eliminating duplicate entries. 
# Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.

# Check if India is present in set
print("India" in country_set, "\n")

countries_set_2 = {'Hong Kong', 'Iceland', 'European Community'}
india = {'India'}

print(country_set - countries_set_2, "\n") # Countries in country_set but not in countries_set_2
print(country_set & countries_set_2, "\n") # Countries in country_set and countries_set_2 (intersection)
print(country_set.union(india), "\n") # Countries in country_set and countries_set_2 (intersection)

In [None]:
# Dictionary
# Dictionary is a set of "key: value" pairs, with the requirement that the keys are unique (within one dictionary). 
# A pair of braces creates an empty dictionary: {}

# We want to select top 5 countries on the basis of their occurance in the dataset
country_dict = dict(df.Country.value_counts()[:5])
print(country_dict)
print("Key: ", country_dict.keys())
print("Values: ", country_dict.values(), "\n")


print(country_dict['Germany']) # Extract values using the key

# Operations like deleting a key-value pair is also possible using ```del```
del country_dict["France"]
print(country_dict)

# For more operations on dictionaries refer to https://docs.python.org/3/tutorial/datastructures.html#dictionaries

### Environment variables & working with files

In [None]:
# Quite often you will end up working with libraries or databases where you might need-
# values from the environment directly for functionality or security of passphrase/keys/port numbers

import os

print("$HOME",os.environ["HOME"])

In [None]:
# Writing to a file
with open("test.txt",'w') as f:
   f.write("my first file\n")   # \n denotes newline character
   f.write("This file\n")
   f.write("contains three lines\n")


!ls  # In the notebook commands prefixed with '!' are run by the host os, ```ls``` command lists all files & folders in current directory
# Output from ls shows that the file was created

In [None]:
# Reading a file
with open("test.txt",'r') as f:
   # perform file operations
    print(f.read())


### File modes-
#### There are multiple modes with which a file can be interacted with-

* 'r'	This is the default mode. It Opens file for reading.
* 'w'	This Mode Opens file for writing. If file does not exist, it creates a new file. If file exists it truncates the file.
* 'a'	Open file in append mode. If file does not exist, it creates a new file.
* 'b'	This opens in binary mode.

### Control flow & logic

#### It is essential to be able to control the behaviour of your code, you may want something to run 100 times given a condition or iterate through a list etc. Control flow with loops & conditional logic help you do the same.

In [None]:
# 'For loop' is a tool to loop over a piece of code, given that a certain condition holds true
# The condition could be a range of values or a variable who's evaluation comes to True(Boolean) value

# Loop for 5 times-
for i in range(5):   # range starts from 0 until the number specified, hence in this case- 0,1,2,3,4
    print(i)
print('\n')    
    
# Iterating a list
for country in removed_countries:
    print(country)
print('\n')    

# Enumerating a dictionary
for key,value in country_dict.items():
    print(key, "--", value)

In [None]:
# If Else conditions are standard ways of controlling whether to do something if an expression/variable is true otherwise do something else

# We want to find list of countries having more than 1000 entries
for key, value in dict(df.Country.value_counts()).items():
    
    if value > 1000:
        print(key, "--",value)
        
        
print('\nClassifying counties based on count:')
        
# We want to find list of countries having more than 1000 entries
for key, value in dict(df.Country.value_counts()[:10]).items():
    
    if value > 10000:
        print("More than 10k count")
    elif value <=10000 and value >= 2000:
        print("Between 10k and 2k ")
    else:
        print("Less than 2k")

In [None]:
# 'While loop' is used for repeatedly running code as long as an expression/variable is true
i = 0
while i<3:
    print(df.iloc[i, [1,2,3]])
    i += 1

### Break/Continue statement

In [None]:
# Break statement terminates the nearest enclosing loop, skipping the optional else clause if the loop has one.
# Continue statement continues with the next cycle of the nearest enclosing loop.

i = 0
while True:
    temp = df.iloc[i, [1,2,3]]
    print(temp)
    
    if temp.Quantity <= 6:
        print("Quantity less than 6")
        i += 1
        continue
        
    if temp.Quantity > 6:
        print("Quantity greater than 6")
        i += 1
        break
    

In [None]:
# We are trying to access few columns from a dataframe but it throws an error was the column quantity is incorrect, the correct name is "Quantity"
df.loc[:3, ['InvoiceNo', 'StockCode', 'quantity']]

In [None]:
df.columns

In [None]:
# We can use error handling to avoid this and print appropriate message to the user

try:
    print(df.loc[:3, ['InvoiceNo', 'StockCode', 'quantity']])
except Exception as e:   # You can also handle multiple exceptions differently by specifying it like-> except NameError:
    print("The specified column is not availabe in dataframe.")
    print("Please use capital letter for 1st character in column name or check if the column exists in the dataframe.")

In [None]:
print(demo_var)
print("This should not print")

In [None]:
# We havent initialized demo_var with any value, so the below will throw an error
try:
  print(demo_var)
except Exception as e:   # You can also handle multiple exceptions differently by specifying it like-> except NameError:
  print("An exception occurred",e)

print("This should print")

### Scope of variables

In [None]:
# All Variables have a scope within which they can be modified, in order to modify them in a function you need to use global keyword

# This would work since no modification takes place

x = 10
def bar():
    print(x)
bar()

# but this code will error out
x = 10
def bar():
    print(x)
    x = x + 1
    print(x)
bar()

In [None]:
# To solve the above issue we use `global` keyword which allows us to modify global variables

x = 10
def bar():
    global x
    print(x)
    x = x + 1
    print(x)
bar()

### Working directory & searching for files with Glob module

In [None]:
# Sometimes you might want to change your working directory to another folder in order to access files
# You can do this with `os` module chdir() & find out your current working directory using getcwd()

# Check current working directory.
import os
retval = os.getcwd()

# This is way of printing text & variables using f-strings with the use of `f` prefix & brackets/{} around variable/expression
print(f"Current working directory- {retval}")  

path = "/kaggle/input"

# Now change the directory
os.chdir( path )

# Check current working directory.
retval = os.getcwd()
print(f"Directory changed successfully- {retval}")

# Revert back using old path
path = "/kaggle/working"
os.chdir( path )
retval = os.getcwd()
print(f"Directory changed back successfully- {retval}")

In [None]:
# Finding files - You may want to search for all files that are of a certain extension, you can do that with `glob` module
# We create mock-files using linux's `touch` command
!touch 1.csv 2.csv 3.csv

import glob

list_of_csv = glob.glob('./*.csv')

print(f"{list_of_csv}")

### Installing libraries online & offline

In [None]:
# You can install python modules using pip(recursive acronym of "Pip Installs Packages")
# !pip install pandas

# In certain environments without internet connectivity you can download packages from pypi & install them offline
# Ex. Flask web-server from https://pypi.org/project/Flask/#files

!wget https://files.pythonhosted.org/packages/f2/28/2a03252dfb9ebf377f40fba6a7841b47083260bf8bd8e737b0c6952df83f/Flask-1.1.2-py2.py3-none-any.whl
    
!pip install Flask-1.1.2-py2.py3-none-any.whl

## Kaggle Platform Additional Info

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session