# Netflix Data Cleaning & Wrangling Part 2

### This Script Contains the Following Points:
#### 1. Importing Libraries & Data
#### 2. Check Data for Additional Cleaning
#### 3. Deriving New Column for Region
#### 4. Export Updated Dataframe

## 1. Importing Libraries & Data

In [5]:
#Import libraries
import pandas as pd
import numpy as np
import os

#Create Folder path to data
path = r'/Users/C SaiVishwanath/Desktop/2024 Projects/Netflix'

#Import Airbnb data as 'df'
df=pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'cleaned_with_days_since_join_netflix.csv'))

## 2. Check Data for Additional Cleaning

In [7]:
# Check output
df.head(5)

Unnamed: 0,User ID,Subscription Type,Monthly Revenue,Join Date,Last Payment Date,Country,Age,Gender,Device,Days Since Join
0,1,Basic,10,1/15/22,10/6/23,United States,28,Male,Smartphone,629
1,2,Premium,15,5/9/21,6/22/23,Canada,35,Female,Tablet,774
2,211,Premium,15,11/15/22,6/27/23,United States,40,Male,Tablet,224
3,4,Standard,12,10/7/22,6/26/23,Australia,51,Female,Laptop,262
4,5,Basic,10,1/5/23,6/28/23,Germany,33,Male,Smartphone,174


In [8]:
# User ID column looks like there may be an error, 3rd row has 211
# Check unique values in User ID
df['User ID'].value_counts().sum()

2500

In [9]:
# No issue with User ID column found

In [10]:
# Check Shape
df.shape

(2500, 10)

In [11]:
# Printing stats
df.describe()

Unnamed: 0,User ID,Monthly Revenue,Age,Days Since Join
count,2500.0,2500.0,2500.0,2500.0
mean,1250.5,12.5084,38.7956,327.3932
std,721.83216,1.686851,7.171778,115.818714
min,1.0,10.0,26.0,8.0
25%,625.75,11.0,32.0,249.0
50%,1250.5,12.0,39.0,330.0
75%,1875.25,14.0,45.0,402.0
max,2500.0,15.0,51.0,776.0


In [12]:
# Print Column data types
df.dtypes

User ID               int64
Subscription Type    object
Monthly Revenue       int64
Join Date            object
Last Payment Date    object
Country              object
Age                   int64
Gender               object
Device               object
Days Since Join       int64
dtype: object

In [13]:
# User ID does not need to by integer
# Changing data type 
df['User ID']=df['User ID'].astype(str)

In [14]:
# Check data types for change
df.dtypes

User ID              object
Subscription Type    object
Monthly Revenue       int64
Join Date            object
Last Payment Date    object
Country              object
Age                   int64
Gender               object
Device               object
Days Since Join       int64
dtype: object

In [15]:
# User ID column successfully changed from integer

In [16]:
# Print basic Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   User ID            2500 non-null   object
 1   Subscription Type  2500 non-null   object
 2   Monthly Revenue    2500 non-null   int64 
 3   Join Date          2500 non-null   object
 4   Last Payment Date  2500 non-null   object
 5   Country            2500 non-null   object
 6   Age                2500 non-null   int64 
 7   Gender             2500 non-null   object
 8   Device             2500 non-null   object
 9   Days Since Join    2500 non-null   int64 
dtypes: int64(3), object(7)
memory usage: 195.4+ KB


In [17]:
# Check for missing values
df.isnull().sum()

User ID              0
Subscription Type    0
Monthly Revenue      0
Join Date            0
Last Payment Date    0
Country              0
Age                  0
Gender               0
Device               0
Days Since Join      0
dtype: int64

In [18]:
# No missing values

In [19]:
# Check for duplicates
df.duplicated().sum()

0

In [20]:
# No duplicates found

In [22]:
# Check unique values in the Subscription column 
df['Subscription Type'].value_counts()

Subscription Type
Basic       999
Standard    768
Premium     733
Name: count, dtype: int64

In [23]:
# Check unique values in Monthly Revenue column
df['Monthly Revenue'].value_counts()

Monthly Revenue
12    455
14    431
13    418
10    409
15    399
11    388
Name: count, dtype: int64

In [24]:
# Check unique values in Country column
df['Country'].value_counts()

Country
United States     451
Spain             451
Canada            317
Australia         183
Germany           183
Mexico            183
Brazil            183
Italy             183
France            183
United Kingdom    183
Name: count, dtype: int64

### 3. Deriving New Column for Region

In [45]:
# Create 'Region' column to group countries by region of the world
# Dictionary to map countries to regions
region_map = {
    'United States': 'North America',
    'Canada': 'North America',
    'Mexico': 'North America',
    'Spain': 'Europe',
    'Germany': 'Europe',
    'Italy': 'Europe',
    'France': 'Europe',
    'United Kingdom': 'Europe',
    'Brazil': 'South America',
    'Australia': 'Oceania'
}

# Create the 'Region' column by mapping the countries to their regions
df['Region'] = df['Country'].map(region_map)

In [47]:
# Check for new column
df.head(5)

Unnamed: 0,User ID,Subscription Type,Monthly Revenue,Join Date,Last Payment Date,Country,Age,Gender,Device,Days Since Join,Region
0,1,Basic,10,1/15/22,10/6/23,United States,28,Male,Smartphone,629,North America
1,2,Premium,15,5/9/21,6/22/23,Canada,35,Female,Tablet,774,North America
2,211,Premium,15,11/15/22,6/27/23,United States,40,Male,Tablet,224,North America
3,4,Standard,12,10/7/22,6/26/23,Australia,51,Female,Laptop,262,Oceania
4,5,Basic,10,1/5/23,6/28/23,Germany,33,Male,Smartphone,174,Europe


### 4. Export Updated Dataframe

In [53]:
# Export updated dataframe to Prepared Data folder 
df.to_csv(os.path.join(path, '02 Data','Prepared Data', 'cleaned_derived_netflix_userbase.csv'))