# Deliverables:

- Submit a single zip-compressed file that has the name: YourLastName_Exercise_1 that has the following files:

 1. Your **PDF document** that has your Source code and output
 2. Your **ipynb script** that has your Source code and output


# Objectives:

In this exercise, you will:

 - Analyze the dataset in the given CSV file
 - Clean the given dataset
 - Load the dataset into sqlite database engine
 - Execute different SQL queries




# Submission Formats :

Create a folder or directory with all supplementary files with your last name at the beginning of the folder name, compress that folder with zip compression, and post the zip-archived folder under the assignment link in Canvas. The following files should be included in an archive folder/directory that is uploaded as a single zip-compressed file. (Use zip, not StuffIt or any 7z or any other compression method.)


1. Complete IPYNB script that has the source code in Python used to access and analyze the data. The code should be submitted as an IPYNB script that can be be loaded and run in Jupyter Notebook for Python
2. Output from the program, such as console listing/logs, text files, and graphics output for visualizations. If you use the Data Science Computing Cluster or School of Professional Studies database servers or systems, include Linux logs of your sessions as plain text files. Linux logs may be generated by using the script process at the beginning of your session, as demonstrated in tutorial handouts for the DSCC servers.
3. List file names and descriptions of files in the zip-compressed folder/directory.


Formatting Python Code
When programming in Python, refer to Kenneth Reitz’ PEP 8: The Style Guide for Python Code:
http://pep8.org/ (Links to an external site.)Links to an external site.
There is the Google style guide for Python at
https://google.github.io/styleguide/pyguide.html (Links to an external site.)Links to an external site.
Comment often and in detail.


###   Data Preparation

As a data scientist for BestDeal retailer, you
have been tasked with improving their revenue and the effectiveness of the
marketing campaign of their electronic products. The given dataset has
10,000 records for the purchases of their customers and is used to predict
customers shopping patterns and to provide answers for ad-hoc queries.
The dataset DirtyData4BestDeal10000.csv is drawn from its database of
customers.

In [None]:
import pandas as pd  # panda's nickname is pd

import numpy as np  # numpy as np

from pandas import DataFrame, Series     # for convenience

import sqlalchemy

from sqlalchemy import create_engine

from sqlalchemy import inspect

### Lets ead the dirtydata4bestdeal CSV and load into a dataframe object

In [None]:
dirtydata4bestdeal=pd.read_csv('DirtyData4BestDeal10000.csv')

In [None]:

# Do you see NaN values below?

dirtydata4bestdeal.head()

### Lets use boxplot to visualize the data and get an idea if there are dirty/messy/invalid data

In [None]:
dirtydata4bestdeal.boxplot(column='CustomerAge')


In [None]:
dirtydata4bestdeal.boxplot(column='LenevoLaptop')

In [None]:
dirtydata4bestdeal.boxplot(column='ZipCode')

### Lets clean the dirty/messy data in the dirtydata4bestdeal dataframe object

In [None]:
# Drop the NaN values 

cleandata4bestdeal=dirtydata4bestdeal.dropna()
cleandata4bestdeal.head()

# Do you see NaN values dropped below?


In [None]:
# objects = ['SamsungTV46LED' , 'SonyTV42LED', 'XBOX360', 'DellLaptop', 'BoseSoundSystem']

# for i in objects:
#     cleandata4bestdeal[i] = pd.to_numeric(cleandata4bestdeal[i], downcast='float', errors='coerce')
    
# cleandata4bestdeal.dtypes

In [None]:
# Add the rest of your code here to clean the data

# Remove duplicates
# cleandata4bestdeal = cleandata4bestdeal.drop_duplicates()
# cleandata4bestdeal = cleandata4bestdeal.dropna()
# print(cleandata4bestdeal.duplicated().sum())
# print(cleandata4bestdeal.isna().sum())

In [None]:
# experimenting with how to identify rows that have outliers greater than sigma * 1.5 or 3. The below is supposed to only return values that are less than 3 (extreme outliers)
# Need to explore in more detail

# from scipy import stats
# cleandata4bestdeal[(np.abs(stats.zscore(cleandata4bestdeal)) < 3).all(axis=1)]

### Lets store the cleaned data into the Database

In [None]:
engine=create_engine('sqlite:///bestdeal.db')

In [None]:
cleandata4bestdeal.to_sql('trans4cust', engine)

** Sanity Test: Did it create the table in bestdeal.db?  Check!!**

In [None]:
insp=inspect(engine)

In [None]:
 insp.get_table_names()

In [None]:
pd.read_sql_table('trans4cust', engine).columns

### Now we are ready to query the Database

#### Query example #1: get the transactions for the customers in zipCode 60616

In [None]:
resultsForBestDealCustTrans=pd.read_sql_query("SELECT * FROM trans4cust WHERE ZipCode='60616'", engine)

In [None]:
resultsForBestDealCustTrans.head()

#### Query example #2: get the transactions for ALL customers

In [None]:
resultsForBestDealCustTrans=pd.read_sql_query("SELECT * FROM trans4cust", engine)

In [None]:
resultsForBestDealCustTrans.head()

#### Query example #3: get the number of customers in every ZipCode sorted by ZipCode

In [None]:
resultsForBestDealCustTrans=pd.read_sql_query("SELECT ZipCode , COUNT(*) as 'num_customers' FROM trans4cust GROUP BY ZipCode  ORDER BY ZipCode", engine)

In [None]:
resultsForBestDealCustTrans

#### Query example #4: get the number of customers for  every Age Group in ZipCode 60616 sorted by CustomerAge

In [None]:
resultsForBestDealCustTrans=pd.read_sql_query(
"SELECT CustomerAge , COUNT(*) as 'num_customers' FROM trans4cust WHERE ZipCode=60616 GROUP BY CustomerAge  ORDER BY CustomerAge", engine)

In [None]:
resultsForBestDealCustTrans

#### Query example #5: Plot in a stacked-bar figure the number of customers who bought SonyTV60LED and/or BoseSoundSystem in  every zipcode that has more than 400 customers who bought these two products(either bought one of these products or the two products)

In [None]:
SonyTV60LEDCustTrans=pd.read_sql_query(
"SELECT ZipCode , COUNT(*) as 'num_customers' FROM trans4cust WHERE SonyTV60LED=1  GROUP BY ZipCode HAVING COUNT(*) > 400", engine)

BoseSoundSystemCustTrans=pd.read_sql_query(
"SELECT ZipCode , COUNT(*) as 'num_customers' FROM trans4cust WHERE BoseSoundSystem=1 GROUP BY ZipCode HAVING COUNT(*) > 400", engine)

In [None]:
SonyTV60LEDCustTrans

In [None]:
BoseSoundSystemCustTrans

In [None]:
SonyTV60LEDCustTrans.ZipCode


In [None]:
import numpy

#   There are zipcodes that Sony got bought but not Bose 
#   but there are also zipcodes that Bose got bought but not Sony
#
#   AND we need to use stacked-bar graph and we have a potentially asymmetrical set  of zipcode values
#   So, we need to do somework to create the symmteric set of zipcode values for Sony and Bose


sonyZipCodeTuples=tuple(SonyTV60LEDCustTrans.ZipCode.astype(numpy.int))
sony_num_customersTuples=tuple(SonyTV60LEDCustTrans.num_customers.astype(numpy.int))

boseZipCodeTuples=tuple(BoseSoundSystemCustTrans.ZipCode.astype(numpy.int))
bose_num_customersTuples=tuple(BoseSoundSystemCustTrans.num_customers.astype(numpy.int))




sony_dict = dict(zip(sonyZipCodeTuples, sony_num_customersTuples))
bose_dict = dict(zip(boseZipCodeTuples, bose_num_customersTuples))

for key in bose_dict.keys():
    if ((key in sony_dict.keys()) == False): sony_dict[key]=0

        
for key in sony_dict.keys():
    if ((key in bose_dict.keys()) == False): bose_dict[key]=0


        
bose_zip= sorted(bose_dict.keys())

sony_zip= sorted(sony_dict.keys())

bose_zip_tuple=tuple(bose_zip)

sony_zip_tuple=tuple(sony_zip)

bose_customer_list=[]

for bose in bose_zip_tuple:
    bose_customer_list.append(bose_dict[bose])

sony_customer_list=[]

for sony in sony_zip_tuple:
    sony_customer_list.append(sony_dict[sony])

bose_customer_tuple=tuple(bose_customer_list)
sony_customer_tuple=tuple(sony_customer_list)


In [None]:
# See docs for bar_stack at the URL
# http://matplotlib.org/examples/pylab_examples/bar_stacked.html

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline 


ind = np.arange(len(sony_customer_tuple))    


# the width of the bars: can also be len(x) sequence
width = .5


p1 = plt.bar(ind, sony_customer_tuple, width,  color='r')
p2 = plt.bar(ind, bose_customer_tuple, width, color='y', bottom=sony_customer_tuple)


plt.ylabel('Number of Customers')
plt.xlabel('Zip Code')

plt.title('Number of Customers by ZipCode and 2 Products')

plt.xticks(ind + width, sony_zip_tuple, horizontalalignment='right')

plt.yticks(np.arange(0, 2000, 100))
plt.legend((p1[0], p2[0]), ('Sony', 'Bose'))

plt.show()

# Requirements :
1. (Use SQL/SQlite): get the number of customers who bought DellLaptop and HPPrinter for every Age group sorted by CustomerAge
2. (Use SQL/SQlite): Get the list of ZipCodes where no customer bought XBOX360 (this query means NOT even a single csutomer in that zip code bought XBOX360)
3. (Use SQL/SQlite/Matplotlib): Plot in a stacked-bar figure the number of customers who bought HPLaptop and/or HPPrinter but did NOT buy  WDexternalHD for  every CustomerAge group that has more than 100 customers who bought these two products(either bought one of these products or the two products but didn't buy WDexternalHD)


In [None]:
# Write your python code that meets the above requirements in this cell
pd.read_sql_table('trans4cust', engine).columns

In [None]:
# 1. (Use SQL/SQlite): get the number of customers who bought DellLaptop and HPPrinter for every Age group sorted by CustomerAge
dellLaptop_and_HPPrinter_trans=pd.read_sql_query(
''' SELECT CustomerAge 
    , COUNT(*) as 'num_customers' 
    FROM trans4cust 
    WHERE DellLaptop=1
    and HPPrinter = 1
    GROUP BY CustomerAge 
    order by CustomerAge ''', engine)

dellLaptop_and_HPPrinter_trans


In [None]:
# 2. (Use SQL/SQlite): Get the list of ZipCodes where no customer bought XBOX360 (this query means NOT even a single csutomer in that zip code bought XBOX360)

zipCodes_noXbox360_trans=pd.read_sql_query(
''' SELECT ZipCode 
    FROM trans4cust 
    WHERE XBOX360=0
    GROUP BY ZipCode  ''', engine)

zipCodes_noXbox360_trans


In [None]:
# 3. (Use SQL/SQlite/Matplotlib): Plot in a stacked-bar figure the number of customers 
# who bought HPLaptop and/or HPPrinter but did NOT buy WDexternalHD for every CustomerAge group 
# that has more than 100 customers who bought these two products
# (either bought one of these products or the two products but didn't buy WDexternalHD)

