# Notebook on "Understanding Data" Section

This notebook will discuss various ways to handle raw data before we can apply data science to it. Methods will include but not limited to :ivalues

Data Inspection: Examine the dataset to understand its structure, format, and potential issues. Look for missing values, duplicates, and outliers. This step helps you identify the areas that need cleaning.

Handling Missing Values:

Delete: Remove rows or columns with too many missing values or data that cannot be recoverId.

Impute: Fill in missing values with reasonable estimates, such as using the mean, median, mode, or predictive models.


Handling Duplicates:

Identify and remove duplicate records or rows that may skew analysis.
De-duplicate data based on specific criteria, such as unique identifiers.


Data Transformation:

Standardize formats: Ensure consistent formats for dates, times, currency, and other data types.

Normalize data: Scale numerical values to a common range.

Encoding: Convert categorical data into numerical values using techniques like one-hot encoding or label encoding.


Handling Outliers:

Identify and assess outliers using statistical methods.

Decide whether to remove, transform, or keep outliers based on domain knowledge and the impact on analysis.


Data Validation:

Check for data consistency and integrity by validating against predefined rules or constraints.
Identify and correct data values that violate business rules.


Data Error Correction:

Correct errors in the dataset, such as typos, inaccuracies, or inconsistencies.
Use domain knowledge, reference data, or automated algorithms to make corrections.

Data Integration: If you have data from multiple sources, integrate them into a cohesive dataset. Ensure that common fields are matched correctly.

Data Formatting:

Ensure that data follows a consistent format, such as proper capitalization, date formats, and unit consistency.
Convert text data to a common case (e.g., lower or upper case).


In [5]:
import pandas as pd
df_bnb = pd.read_csv('airbnb.csv')
df_cor = pd.read_csv('corporate-superhero.csv')
df_bnb

Unnamed: 0,hotel_name,Previous_price,discount_pric,description,hotel_rating,hotel_reviews,urls,No_gust,No_bath,No_bedroom,No_bed
0,Mountain Apartment,0,783,2 guests · 1 bedroom · 1 bed · 1 bath,4.88,153,/rooms/7965434?location=Austria&check_in=2020-...,2,1 bath,1 bedroom,· 1 bed
1,Ruhige Wohnung in Oftering nähe Linz Wels,0,524,6 guests · 2 bedrooms · 7 beds · 1 bath,4.88,214,/rooms/23823538?location=Austria&check_in=2020...,6,1 bath,2 bedrooms,· 2 bed
2,Apartment in the Old town of Steyr,0,653,3 guests · 1 bedroom · 2 beds · 1.5 baths,4.94,176,/rooms/24426446?location=Austria&check_in=2020...,3,1.5 baths,1 bedroom,· 1 bed
3,cozy house with picturesque garden,0,505,1 guest · 1 bedroom · 1 bed · 1 shared bath,4.79,80,/rooms/647295?location=Austria&check_in=2020-0...,1,1 shared bath,1 bedroom,· 1 bed
4,luxurios apartment ► Messe/Uno City,0,735,2 guests · 1 bedroom · 2 beds · 1 bath,4.62,384,/rooms/6299386?location=Austria&check_in=2020-...,2,1 bath,1 bedroom,· 1 bed
...,...,...,...,...,...,...,...,...,...,...,...
1195,Schönes Ferienhaus in ruhiger Lage,0,1908,6 guests · 3 bedrooms · 4 beds · 1.5 baths,4.66,56,/rooms/10170704?location=Germany&check_in=2020...,6,1.5 baths,3 bedrooms,· 3 bed
1196,Ferienwohnung am Iberg,0,3434,2 guests · 1 bedroom · 1 bed · 1 bath,4.81,54,/rooms/15574273?location=Germany&check_in=2020...,2,1 bath,1 bedroom,· 1 bed
1197,"Lovely Flat with BALCONY, VIEW and GARDEN",0,4419,4 guests · 1 bedroom · 1 bed · 1 bath,4.77,147,/rooms/2143228?location=Germany&check_in=2020-...,4,1 bath,1 bedroom,· 1 bed
1198,"Sauberes Appartment, zentral, erholsam, ruhig",2390,2079,3 guests · 1 bedroom · 2 beds · 1 shared bath,4.75,221,/rooms/13730181?location=Germany&check_in=2020...,3,1 shared bath,1 bedroom,· 1 bed


In [4]:
df_cor

Unnamed: 0,Customer ID,Email Address,Company,Order Number,Invoice Total
0,1001,weeklychallenge@123abc.co.ayx,Livetube,60505,163465
1,1002,weeklychalleng.e@123abc.co.ayx,Voonyx,37808,48322
2,1005,weeklychallen.ge@123abc.co.ayx,Leenti,57955,182446
3,1006,weeklychallen.g.e@123abc.co.ayx,Avaveo,61919,137167
4,1008,weeklychalle.nge@123abc.co.ayx,Leexo,68645,18788
...,...,...,...,...,...
173,1383,w.eeklycha.ll.e.ng.e@123abc.co.ayx,Shufflester,41520,336
174,1387,w.eeklycha.ll.e.n.ge@123abc.co.ayx,Flashdog,41595,76310
175,1393,w.eeklycha.ll.e.n.g.e@123abc.co.ayx,Chatterpoint,51523,150236
176,1398,w.eeklycha.l.lenge@123abc.co.ayx,Topiczoom,12745,181384


In [12]:
# checking for nulls in our dataset
df_cor.isnull().sum()

Customer ID      0
Email Address    0
Company          0
Order Number     0
Invoice Total    0
dtype: int64

In [14]:
# sorting by ascending order by 'Customer ID'
df_cor = df_cor.sort_values(by='Customer ID', ascending=True)
df_cor

Unnamed: 0,Customer ID,Email Address,Company,Order Number,Invoice Total
177,1399,w.eeklycha.l.leng.e@123abc.co.ayx,Aimbo,44911,113033
176,1398,w.eeklycha.l.lenge@123abc.co.ayx,Topiczoom,12745,181384
175,1393,w.eeklycha.ll.e.n.g.e@123abc.co.ayx,Chatterpoint,51523,150236
174,1387,w.eeklycha.ll.e.n.ge@123abc.co.ayx,Flashdog,41595,76310
173,1383,w.eeklycha.ll.e.ng.e@123abc.co.ayx,Shufflester,41520,336
...,...,...,...,...,...
4,1008,weeklychalle.nge@123abc.co.ayx,Leexo,68645,18788
3,1006,weeklychallen.g.e@123abc.co.ayx,Avaveo,61919,137167
2,1005,weeklychallen.ge@123abc.co.ayx,Leenti,57955,182446
1,1002,weeklychalleng.e@123abc.co.ayx,Voonyx,37808,48322


In [11]:
df_bnb.isnull().sum()

hotel_name         0
Previous_price     0
discount_pric      0
description        0
hotel_rating       0
hotel_reviews      0
urls               0
No_gust            0
No_bath           15
No_bedroom         0
No_bed             0
dtype: int64