# Detecting Electrical Plugs in E-Commerce Website Products
#### In this notebook, we will explore an Ecommerce website's dataset. In some marketplaces, products having electrical plugs need to be accompanied by a compliance form. Thus it is desirable to identify products which have an electrical plug.

__Dataset schema:__ 
- __ASIN__: Product ASIN
- __target_label:__ Binary field with values in {0,1}. A value of 1 show ASIN has a plug, otherwise 0.
- __ASIN_STATIC_ITEM_NAME:__ Title of the ASIN.
- __ASIN_STATIC_PRODUCT_DESCRIPTION:__ Description of the ASIN
- __ASIN_STATIC_GL_PRODUCT_GROUP_TYPE:__ GL information for the ASIN.
- __ASIN_STATIC_ITEM_PACKAGE_WEIGHT:__ Weight of the ASIN.
- __ASIN_STATIC_LIST_PRICE:__ Price information for the ASIN.
- __ASIN_STATIC_BATTERIES_INCLUDED:__ Information whether batteries are included along with the product.
- __ASIN_STATIC_BATTERIES_REQUIRED:__ Information whether batteries are required for using the product.
- __ASIN_STATIC_ITEM_CLASSIFICATION:__ Item classification of whether it is a standalone or bundle parent item etc

# Exploratory Data Analysis

## 1. Reading the data
Let's read the dataset into a dataframe.

In [5]:
import pandas as pd

# import the datasets
df = pd.read_csv('C:\\Users\\solharsh\\Downloads\\Tabular Data MLU\\final_project\\final_project\\asin_product.csv',encoding= 'unicode_escape')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Region Id,MarketPlace Id,ASIN,Binding Code,binding_description,brand_code,case_pack_quantity,classification_code,classification_description,color_map,...,pkg_weight,pkg_weight_uom,pkg_width,release_date_embargo_level,dw_creation_date,dw_last_updated,is_deleted,last_updated,version,external_testing_certification
0,1,1.0,153427507,hardcover,Hardcover,,,base_product,Base Product,,...,,,,,4-Jan-11,22-Jul-16,N,21-Jul-16,145.0,
1,1,1.0,267648340,hardcover,Hardcover,FOS3T,,base_product,Base Product,,...,0.85,pounds,5.98,,16-Sep-16,23-Feb-18,N,22-Feb-18,33.0,
2,1,1.0,545496470,hardcover,Hardcover,KLUTZ,6.0,base_product,Base Product,Black,...,1.631404,pounds,8.267717,low,20-Aug-12,23-Dec-17,N,23-Dec-17,20912.0,
3,1,1.0,679858040,paperback,Paperback,,,base_product,Base Product,,...,1.2,pounds,7.7,,4-Jan-11,4-Nov-17,N,3-Nov-17,2108.0,
4,1,1.0,078694742X,toy,Toy,WZDCS,12.0,base_product,Base Product,,...,0.62,pounds,6.535433,,4-Jan-11,21-Oct-17,N,20-Oct-17,9395.0,


## 2. Overall Statistics:
We will look at number of rows, columns and simple statistics of the dataset.

In [6]:
# Print the first 10 rows
# NaN means missing data
df.head(10)

Unnamed: 0,Region Id,MarketPlace Id,ASIN,Binding Code,binding_description,brand_code,case_pack_quantity,classification_code,classification_description,color_map,...,pkg_weight,pkg_weight_uom,pkg_width,release_date_embargo_level,dw_creation_date,dw_last_updated,is_deleted,last_updated,version,external_testing_certification
0,1,1.0,153427507,hardcover,Hardcover,,,base_product,Base Product,,...,,,,,4-Jan-11,22-Jul-16,N,21-Jul-16,145.0,
1,1,1.0,267648340,hardcover,Hardcover,FOS3T,,base_product,Base Product,,...,0.85,pounds,5.98,,16-Sep-16,23-Feb-18,N,22-Feb-18,33.0,
2,1,1.0,545496470,hardcover,Hardcover,KLUTZ,6.0,base_product,Base Product,Black,...,1.631404,pounds,8.267717,low,20-Aug-12,23-Dec-17,N,23-Dec-17,20912.0,
3,1,1.0,679858040,paperback,Paperback,,,base_product,Base Product,,...,1.2,pounds,7.7,,4-Jan-11,4-Nov-17,N,3-Nov-17,2108.0,
4,1,1.0,078694742X,toy,Toy,WZDCS,12.0,base_product,Base Product,,...,0.62,pounds,6.535433,,4-Jan-11,21-Oct-17,N,20-Oct-17,9395.0,
5,1,1.0,1059998254,consumer_electronics,Electronics,,,base_product,Base Product,black,...,0.05,pounds,2.6,,13-May-14,30-Sep-17,N,29-Sep-17,117.0,
6,1,1.0,106171327X,pc,Personal Computers,SDSK9,1.0,base_product,Base Product,,...,0.200621,pounds,3.0,,7-Oct-14,14-Mar-18,N,14-Mar-18,1777.0,
7,1,1.0,1223027643,toy,Toy,THIX7,1.0,base_product,Base Product,,...,0.449743,pounds,8.0,,9-Mar-13,10-Mar-18,N,10-Mar-18,1921.0,
8,1,1.0,1223062562,toy,Toy,MEXE9,20.0,base_product,Base Product,Multi,...,0.661387,pounds,11.417323,none,22-Mar-13,24-Mar-18,N,24-Mar-18,6884.0,
9,1,1.0,1223068080,toy,Toy,MEXE9,12.0,base_product,Base Product,,...,0.5,pounds,9.8,,12-Jul-13,10-Mar-18,N,9-Mar-18,9439.0,


In [7]:
# This will print basic statistics for numerical columns
df.describe()

Unnamed: 0,MarketPlace Id,case_pack_quantity,discontinued_date,ean,excluded_direct_browse_node_id,fedas_id,fma_qualified_price_max,Product Group Code,item_classification_id,item_display_diameter,...,recall_notice_receive_date,unit_count,upc,variation_theme_id,video_game_region,pkg_height,pkg_length,pkg_weight,pkg_width,version
count,65715.0,20849.0,0.0,54270.0,12080.0,13.0,51175.0,65715.0,65714.0,368.0,...,0.0,8428.0,48257.0,25485.0,20.0,52388.0,52388.0,51921.0,52388.0,65715.0
mean,1.0,7.964075,,1605526000000.0,1584601000.0,110682.230769,94.768035,170.466697,1.195438,17.598872,...,,37.247203,617396200000.0,35.831116,1.6,3.470359,12.572308,6.23624,7.714382,11258.312501
std,0.0,43.698409,,2294313000000.0,3106213000.0,23918.730618,233.215625,117.378078,1.04951,38.3671,...,,565.518245,276886800000.0,88.842,0.940325,3.775752,11.614111,114.104034,6.047768,75050.40753
min,1.0,0.0,,5012.0,0.0,100640.0,0.01,14.0,1.0,0.0,...,,-9.0,5012.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0
25%,1.0,1.0,,638000000000.0,1065616.0,100954.0,23.41,79.0,1.0,2.1,...,,1.0,612000000000.0,2.0,1.0,1.1,6.4,0.25,3.858268,53.0
50%,1.0,1.0,,745000000000.0,165793000.0,100956.0,36.96,147.0,1.0,6.0,...,,1.0,715000000000.0,8.0,1.0,2.4,9.212598,0.75,6.0,191.0
75%,1.0,6.0,,853000000000.0,2255565000.0,100959.0,76.74,201.0,1.0,16.35,...,,12.0,794000000000.0,24.0,3.0,4.3,14.2,2.799871,9.9,1121.5
max,1.0,2880.0,,10000000000000.0,17548240000.0,167990.0,9099.68,594.0,15.0,620.0,...,,50000.0,1000000000000.0,1439.0,3.0,74.0,268.110236,25353.16013,109.0,660051.0


In [8]:
# Let's see the data types and non-null values for each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65773 entries, 0 to 65772
Columns: 112 entries, Region Id to external_testing_certification
dtypes: float64(46), object(66)
memory usage: 56.2+ MB
