In [1]:
import pandas as pd

Dataset is from kaggle.
https://www.kaggle.com/carrie1/ecommerce-data. You don't have to download it though, as it is already in the repo folder.

In [2]:
data = pd.read_csv('ecommerce_data.csv', index_col=0, parse_dates=True)

In [3]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/01/2010 08:26,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/01/2010 08:26,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/01/2010 08:26,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/01/2010 08:26,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/01/2010 08:26,3.39,17850,United Kingdom


## Making a local, server-less SQL database
[SQLite](https://www.sqlite.org/whentouse.html) is an easy-to-use SQL database system, that lets an entire database live in just one file on your computer, no server needed.

Here, we will make our own SQLite database in the repo folder, and populate with the same e-commerce data we just read into a pandas dataframe.
This way, we will be able to compare SQL commands (called to an actual SQL database) with their Python equvilents on the same data.

In [4]:
import sqlite3, os, os.path

In [5]:
# Seeing where our SQL database will be created.
os.getcwd()

'/Users/rachelberryman/Documents/FromSQLtoPython'

In [6]:
# Checking to see if a database with the same name already exists.
if os.path.isfile("E_commerce_data.db"):
    os.remove("E_commerce_data.db")

# here, we are creating our new database, called E_commerce_data.db, and establishing a connection to it.
# We name this connection "conn". We will need to specify that connection in every SQL command we run,
# so that the queries are run on the correct database.
conn = sqlite3.connect("E_commerce_data.db")

In [7]:
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'], format="%m/%d/%Y %H:%M")

In [8]:
# turning our Pandas DataFrame into a SQL database.
data.to_sql('data', conn, if_exists='replace', index=False)

Taking an initial look at our data. 

In [9]:
data.groupby('Country').count()

Unnamed: 0_level_0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Australia,1259,1259,1259,1259,1259,1259,1259
Austria,401,401,401,401,401,401,401
Bahrain,17,17,17,17,17,17,17
Belgium,2069,2069,2069,2069,2069,2069,2069
Brazil,32,32,32,32,32,32,32
Canada,151,151,151,151,151,151,151
Channel Islands,758,758,758,758,758,758,758
Cyprus,622,622,622,622,622,622,622
Czech Republic,30,30,30,30,30,30,30
Denmark,389,389,389,389,389,389,389


We have ~40,000 records.

In [10]:
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829.0,406829.0,406829.0
mean,12.061303,3.460471,15287.69057
std,248.69337,69.315162,1713.600303
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13953.0
50%,5.0,1.95,15152.0
75%,12.0,3.75,16791.0
max,80995.0,38970.0,18287.0


Of this, about 36,000 of them are from the UK.

In [11]:
data[data['Country']=='United Kingdom'].describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,361878.0,361878.0,361878.0
mean,11.077029,3.256007,15547.871368
std,263.129266,70.654731,1594.40259
min,-80995.0,0.0,12346.0
25%,2.0,1.25,14194.0
50%,4.0,1.95,15514.0
75%,12.0,3.75,16931.0
max,80995.0,38970.0,18287.0


Describe will calculate summary statistics for all of the numeric columns.
It is also a good way to see if there are non-numeric columns that are being misclassified. 
"Customer ID", which is made up of numbers, was throught to be a numeric column when the CSV was read.

## Question 1: How many countries does are company sell to?

In [12]:
len(data['Country'].unique())

# or:

len(set(data['Country']))

37

In [13]:
pd.read_sql_query("""
SELECT COUNT(DISTINCT Country)
FROM data
""", conn)

Unnamed: 0,COUNT(DISTINCT Country)
0,37


## Question 2: What are our top 10 countries in terms of items sold?

In [14]:
data.groupby(['Country']).Quantity.sum().sort_values(ascending=False).head(10)

Country
United Kingdom    4008533
Netherlands        200128
EIRE               136329
Germany            117448
France             109848
Australia           83653
Sweden              35637
Switzerland         29778
Spain               26824
Japan               25218
Name: Quantity, dtype: int64

In [15]:
pd.read_sql_query("""
SELECT Country, SUM(Quantity)
FROM data
GROUP BY Country
ORDER BY SUM(Quantity) desc
LIMIT 10
""", conn)

Unnamed: 0,Country,SUM(Quantity)
0,United Kingdom,4008533
1,Netherlands,200128
2,EIRE,136329
3,Germany,117448
4,France,109848
5,Australia,83653
6,Sweden,35637
7,Switzerland,29778
8,Spain,26824
9,Japan,25218




## Question 3: How much money did we make on stickers in 2017?

In [16]:
stickers = data[data['Description'].str.contains("STICKER SHEET")==True].reset_index()
stickers['Revenue'] = stickers['Quantity'] * stickers['UnitPrice']
stickers['Revenue'].sum()

1139.0

In [17]:
pd.read_sql_query("""
SELECT SUM(UnitPrice * Quantity)
FROM data
WHERE Description LIKE "%STICKER SHEET%"
""", conn)

Unnamed: 0,SUM(UnitPrice * Quantity)
0,1139.0


## Question 4: "How much money did we make in 2011 in the UK?"

In [18]:
uk_2011 = data.loc[(data['Country']=='United Kingdom') & (data['InvoiceDate'].dt.year==2011)] 
sum(uk_2011['UnitPrice'] * uk_2011['Quantity'])

6284073.65400224

In [19]:
# SQLite doesn't have a year() function, so we have to get creative. 
# In the WHERE clause of our query, we have to format the date to just include its year component.
# You can read about date formats in SQLite here: https://www.tutorialspoint.com/sqlite/sqlite_date_time.html
pd.read_sql_query("""
SELECT SUM(UnitPrice * Quantity)
FROM data
WHERE Country = "United Kingdom" AND strftime('%Y', InvoiceDate) = '2011'
""", conn)

Unnamed: 0,SUM(UnitPrice * Quantity)
0,6284074.0


EXERCISE: what you can't do with SQL: 
- moving beyond explanatory queries to predictive analytics: simple ML model
- Visualisations with matplotlib