# Assignment 1

## Title: Data Wrangling, I

### Problem Statement
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insull(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables
etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps and
explain everything that you do to import/read/scrape the data set.

## 1. Import all the required Python Libraries.

In [74]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site).

### Grocery dataset
Market basket analysis, product recommendations and store optimization.

Link: https://www.kaggle.com/datasets/tanishqdublish/grocery-dataset

#### About Dataset
Scraped grocery data from Costco's online marketplace.
##### **Features:** 
**Sub Category:** This column categorizes the grocery items into subcategories, providing a detailed classification for easier analysis and organization.</br>
**Price:** Represents the monetary value of the grocery item, indicating its cost or retail price in the specified currency.</br>
**Discount:** Reflects any discounts or promotional offers applicable to the respective grocery item, providing insights into pricing strategies.</br>
**Rating:** Indicates customer satisfaction or product quality based on user ratings, offering a measure of the overall perceived value of the grocery item.</br>
**Title:** Describes the name or title of the grocery item, providing a concise identifier for easy reference and understanding.

#### Collaborators
- Tanishq dublish (Owner)

#### Provenance
##### SOURCES
- From the internet.
#### COLLECTION METHODOLOGY
- Author collected this dataset from the internet.

#### License
- Apache 2.0

## 3. Load the Dataset into pandas data frame.

In [75]:
data = pd.read_csv("GroceryDataset.csv")

Describe the data

In [76]:
data.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake\nCertified Kosher OU-...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count\nIndividually wrapped\nMade in and I...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways\n32 oz 2-Pack\nNo Prese...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake\nCertified Kosh...",A cake the dessert epicure will die for!To the...


In [77]:
data.tail(5)

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
1752,Snacks,$23.99,No Discount,,"Oberto Thin Style Smoked Sausage Stick, Cockta...",$,Cocktail Pepperoni Smoked Sausage Sticks 3...,Cocktail PepperoniSmoked Sausage Sticks3 oz ba...
1753,Snacks,$49.99,No Discount,,"Cheetos Crunchy, Original, 2.1 oz, 64-count",$,Made with Real Cheese,64-count2.1 oz Bags
1754,Snacks,$22.99,No Discount,,"Sabritas Chile & Limon Mix, Variety Pack, 30-c...",$,Chile & Limón Mix Variety Pack 30 ct Net...,8-Doritos Dinamita Chile Limón Flavored Rolled...
1755,Snacks,$17.49,No Discount,,"Fruit Roll-Ups, Variety Pack, 72-count",$,Variety Pack 1 Box with 72 Rolls Flavore...,Fruit Flavored Snacks\nVariety Includes: Straw...
1756,Snacks,$21.99,No Discount,,"Takis, Rolled Tortilla Chips, Intense Nacho, 1...",$,"Intense Nacho Cheese Non-Spicy 1 oz bag, 5...",Takis Non-Spicy Cheese Tortilla Chips\nIndivid...


In [78]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1757 entries, 0 to 1756
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Sub Category         1757 non-null   object
 1   Price                1754 non-null   object
 2   Discount             1757 non-null   object
 3   Rating               682 non-null    object
 4   Title                1757 non-null   object
 5   Currency             1752 non-null   object
 6   Feature              1739 non-null   object
 7   Product Description  1715 non-null   object
dtypes: object(8)
memory usage: 109.9+ KB


## 4. Data Preprocessing: check for missing values in the data using pandas insull(), describe() function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.

In [79]:
data.describe()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
count,1757,1754,1757,682,1757,1752,1739,1715
unique,19,184,42,483,1484,1,1401,1435
top,Snacks,$14.99,No Discount,No Reviews,"Ziploc Seal Top Freezer Bag, Gallon, 38-count,...",$,Pick Your Arrival Date at Checkout Gift Mess...,Item may be available in your local warehouse ...
freq,293,81,1626,61,4,1752,7,5


In [80]:
data.isnull()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
1752,False,False,False,True,False,False,False,False
1753,False,False,False,True,False,False,False,False
1754,False,False,False,True,False,False,False,False
1755,False,False,False,True,False,False,False,False


In [81]:
data.isnull().sum()

Sub Category              0
Price                     3
Discount                  0
Rating                 1075
Title                     0
Currency                  5
Feature                  18
Product Description      42
dtype: int64

In [82]:
data.notnull()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...
1752,True,True,True,False,True,True,True,True
1753,True,True,True,False,True,True,True,True
1754,True,True,True,False,True,True,True,True
1755,True,True,True,False,True,True,True,True


In [83]:
data.notnull().sum()

Sub Category           1757
Price                  1754
Discount               1757
Rating                  682
Title                  1757
Currency               1752
Feature                1739
Product Description    1715
dtype: int64

In [84]:
data.shape

(1757, 8)

In [85]:
data.size

14056

## 5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.

In [86]:
data.dtypes

Sub Category           object
Price                  object
Discount               object
Rating                 object
Title                  object
Currency               object
Feature                object
Product Description    object
dtype: object

Conversions 
- Sub Category: object - string
- Price: object to number
- Rating: object - string
- Title: object - string
- Currency: -
- Feature: object - string
- Product Description: object - string

In [87]:
data['Sub Category'] = data['Sub Category'].astype('string') 
data['Rating'] = data['Rating'].astype('string') 
data['Title'] = data['Title'].astype('string') 
data['Feature'] = data['Feature'].astype('string') 
data['Product Description'] = data['Product Description'].astype('string') 

In [88]:
data['Price'] = data['Price'].str.replace('$', '')

In [89]:
data['Price'] = data['Price'].str.replace("$through-", "")
data['Price'] = data['Price'].str.replace(",", "")

In [90]:
data['Price'] = data['Price'].str[:5].astype('float')

In [91]:
data.dtypes

Sub Category           string[python]
Price                         float64
Discount                       object
Rating                 string[python]
Title                  string[python]
Currency                       object
Feature                string[python]
Product Description    string[python]
dtype: object

In [92]:
data.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake Certified Kosher OU-D...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,159.9,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count Individually wrapped Made in and Imp...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways 32 oz 2-Pack No Preserv...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake Certified Koshe...",A cake the dessert epicure will die for!To the...


## Data Normalization

In [93]:
def min_max_normalize(
    name: str
):
    data[ name ] = (data[ name ] - data[ name ].min()) / ( data[ name ].max() - data[ name ].min() )

In [94]:
min_max_normalize( "Price" ) 

## 6. Turn categorical variables into quantitative variables in Python. In addition to the codes and outputs, explain every operation that you do in the above steps and explain everything that you do to import/read/scrape the data set.

### Using Label Encoding to Convert Categorical data to Numerical data

1. Loading Data

In [95]:
data.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,0.026566,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake Certified Kosher OU-D...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,0.07815,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,0.020551,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count Individually wrapped Made in and Imp...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,0.018045,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways 32 oz 2-Pack No Preserv...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,0.02807,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake Certified Koshe...",A cake the dessert epicure will die for!To the...


2. Check for null values

In [96]:
data.isna().sum()

Sub Category              0
Price                     3
Discount                  0
Rating                 1075
Title                     0
Currency                  5
Feature                  18
Product Description      42
dtype: int64

3. Remove the null values

In [97]:
data = data.dropna()

4. Dropped NA values

In [98]:
data.isna().sum()

Sub Category           0
Price                  0
Discount               0
Rating                 0
Title                  0
Currency               0
Feature                0
Product Description    0
dtype: int64

5. Exclude Numeric columns and show only categorical columns

In [99]:
data_exclude_numeric = data.select_dtypes(exclude=np.number).columns

6. Categorical cloumns

In [100]:
data[data_exclude_numeric].head()

Unnamed: 0,Sub Category,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake Certified Kosher OU-D...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count Individually wrapped Made in and Imp...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways 32 oz 2-Pack No Preserv...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake Certified Koshe...",A cake the dessert epicure will die for!To the...


In [101]:
data_exclude_numeric

Index(['Sub Category', 'Discount', 'Rating', 'Title', 'Currency', 'Feature',
       'Product Description'],
      dtype='object')

7. Loopimg through _"data_execlude_numeric"_ to encode the columns

In [102]:
from sklearn.preprocessing import LabelEncoder

In [103]:
label_encoder = LabelEncoder()

In [104]:
data_categorical = data
for i in data_exclude_numeric:
    data_categorical[i] = label_encoder.fit_transform(data[i])
print("Encoded data")
data_categorical.head()

Encoded data


Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,0,0.026566,14,164,134,0,2,26
1,0,0.07815,14,469,472,0,457,147
2,0,0.020551,14,109,451,0,58,310
3,0,0.018045,14,431,129,0,273,472
4,0,0.02807,14,298,135,0,0,27


In [105]:
data_categorical

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,0,0.026566,14,164,134,0,2,26
1,0,0.078150,14,469,472,0,457,147
2,0,0.020551,14,109,451,0,58,310
3,0,0.018045,14,431,129,0,273,472
4,0,0.028070,14,298,135,0,0,27
...,...,...,...,...,...,...,...,...
1740,17,0.018045,14,184,178,0,45,375
1746,17,0.014536,14,354,439,0,378,400
1748,17,0.023058,14,0,203,0,80,203
1749,17,0.023058,14,0,205,0,81,202


In [106]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoding = OneHotEncoder(sparse=False)
one_hotEncoded = one_hot_encoding.fit_transform(data[data_exclude_numeric])



In [107]:
one_hotEncoded

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [108]:
data.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,0,0.026566,14,164,134,0,2,26
1,0,0.07815,14,469,472,0,457,147
2,0,0.020551,14,109,451,0,58,310
3,0,0.018045,14,431,129,0,273,472
4,0,0.02807,14,298,135,0,0,27


In [109]:
new_onehot_encoded_dataFrame = pd.DataFrame(one_hotEncoded,
                                            columns=one_hot_encoding.get_feature_names_out(data_exclude_numeric))

In [110]:
new_onehot_encoded_dataFrame.head()

Unnamed: 0,Sub Category_0,Sub Category_1,Sub Category_2,Sub Category_3,Sub Category_4,Sub Category_5,Sub Category_6,Sub Category_7,Sub Category_8,Sub Category_9,...,Product Description_551,Product Description_552,Product Description_553,Product Description_554,Product Description_555,Product Description_556,Product Description_557,Product Description_558,Product Description_559,Product Description_560
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [111]:
new_onehot_encoded_dataFrame.columns

Index(['Sub Category_0', 'Sub Category_1', 'Sub Category_2', 'Sub Category_3',
       'Sub Category_4', 'Sub Category_5', 'Sub Category_6', 'Sub Category_7',
       'Sub Category_8', 'Sub Category_9',
       ...
       'Product Description_551', 'Product Description_552',
       'Product Description_553', 'Product Description_554',
       'Product Description_555', 'Product Description_556',
       'Product Description_557', 'Product Description_558',
       'Product Description_559', 'Product Description_560'],
      dtype='object', length=2149)