<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/1_Basics/25_Pandas_Cleaning.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Cleaning

Load in the data.

In [1]:
# Install datasets Library (if not already installed)
# !pip install datasets

# Importing Libraries
import pandas as pd
from datasets import load_dataset

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

  from .autonotebook import tqdm as notebook_tqdm


## Date and Time

### Datetime

#### Notes

* `pd.to_datetime()`: Convert argument to datetime.

#### Example

In our DataFrame the `job_posted_date` is actually a string not a datetime format. First let's convert it to datetime format. 

We'll also use `info()` to check if the data type changed from a string to a `datetime` format.

In [2]:
# Convert 'job_posted_date' to datetime without specifying the exact format
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'], errors='coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 787686 entries, 0 to 787685
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   job_title_short        787686 non-null  object        
 1   job_title              787685 non-null  object        
 2   job_location           786646 non-null  object        
 3   job_via                787679 non-null  object        
 4   job_schedule_type      774976 non-null  object        
 5   job_work_from_home     787686 non-null  bool          
 6   search_location        787686 non-null  object        
 7   job_posted_date        787686 non-null  datetime64[ns]
 8   job_no_degree_mention  787686 non-null  bool          
 9   job_health_insurance   787686 non-null  bool          
 10  job_country            787633 non-null  object        
 11  salary_rate            33073 non-null   object        
 12  salary_year_avg        22026 non-null   floa

### Date

#### Notes

* `dt`: accessor that provides a way to access specialized methods and properties we use to work with datatime data within a pandas series. 
* `date`: extract the date component from the datetime object in the series. 
* Use together `dt.date` on our series.

#### Example

Now let's turn it from a datetime to a date using `dt.date`.

In [3]:
df['job_posted_date'] = df['job_posted_date'].dt.date

df.head()

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,Data Analyst,Data Analytics,"Monterrey, Nuevo Leon, Mexico",via BeBee,Part-time,False,Mexico,2023-12-05,False,False,Mexico,,,,2U Bootcamps Instructional Engagement,"['go', 'python', 'mongodb', 'mongodb', 'css', ...","{'analyst_tools': ['tableau'], 'databases': ['..."
1,Data Scientist,Data Scientist Intern,"Lisbon, Portugal",via Empregos Trabajo.org,Full-time,False,Portugal,2023-08-20,False,False,Portugal,,,,Nokia,"['sql', 'python', 'sql server', 'oracle', 'azu...","{'analyst_tools': ['sap'], 'cloud': ['oracle',..."
2,Data Analyst,"Manager, Data Analytics","Guanacaste Province, Lagunilla, Costa Rica",via BeBee Costa Rica,Full-time,False,Costa Rica,2023-11-21,False,False,Costa Rica,,,,Thermo Fisher Scientific,,
3,Data Engineer,Data Engineer,"Lambayeque, Peru",via BeBee Perú,Full-time,False,Peru,2023-11-21,True,False,Peru,,,,Emprego,,
4,Data Analyst,Technical Data Analyst,"Fairfax, VA",via Indeed,Contractor,False,"New York, United States",2023-12-20,True,False,United States,,,,Info Origin Inc.,"['sql', 'python', 'jira']","{'async': ['jira'], 'programming': ['sql', 'py..."


#### Note: For the rest of the time we're loading the data in we'll be automatically turn the `job_posted_date` column into a datetime object.

## Sorting Values

### Notes

* `sort_values()` sorts a DataFrame or a specific column in ascending or descending order based on one or more columns. 
* Typically this is used to sort by a specific column/s.
* Parameters: 
    * `by` - column name or list of column names to sort by
    * `ascending` - boolean or list of booleans, default `True`, to sort by descending you would use `False`
    * `inplace` - whether to modify the DataFrame in place or return a new one

### Example

Let's sort our DataFrame by the `job_posted_date` in descending order (from most recent date to least). 

In [4]:
df.sort_values(by='job_posted_date', ascending=False, inplace=True)
df

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
287190,Data Engineer,Data Engineer H/F H/F,France,via BeBee,Full-time,False,France,2023-12-31,False,False,France,,,,Bolloré Group,"['python', 'azure', 'aws', 'gcp', 'databricks'...","{'analyst_tools': ['tableau', 'cognos', 'power..."
770443,Data Scientist,Data Scientist,"Chaville, France",via BeBee,Full-time,False,France,2023-12-31,False,False,France,,,,DASSAULT SYSTEMES,"['python', 'sql', 'power bi', 'tableau']","{'analyst_tools': ['power bi', 'tableau'], 'pr..."
770111,Data Scientist,(Junior) Analyst - Website/Social Media*,"Frankfurt, Germany",via XING,Full-time,False,Germany,2023-12-31,True,False,Germany,,,,Nintendo of Europe GmbH,"['sql', 'html', 'css', 'excel', 'tableau']","{'analyst_tools': ['excel', 'tableau'], 'progr..."
465217,Data Engineer,Work From Home Data Platform Engineer,"Risaralda, Caldas, Colombia",via Sercanto,Full-time,False,Colombia,2023-12-31,True,False,Colombia,,,,Bairesdev S.a.,"['python', 'scala', 'java', 'nosql', 'kafka', ...","{'libraries': ['kafka', 'airflow'], 'programmi..."
449221,Data Analyst,وظائف Data Analyst (Part time) – الجيزة,"Giza, El Omraniya, Egypt",via وظائف,Part-time,False,Egypt,2023-12-31,True,False,Egypt,,,,شركة,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
704528,Data Scientist,Data Scientist 2,"Atlanta, GA",via Trabajo.org,Full-time,False,"Florida, United States",2022-12-31,False,True,United States,,,,UnitedHealth Group,"['python', 'sql', 'hugging face', 'pytorch', '...","{'analyst_tools': ['excel'], 'libraries': ['hu..."
671315,Data Scientist,Data Quality Assessment,"Nairobi, Kenya",via BeBee Kenya,Full-time,False,Kenya,2022-12-31,False,False,Kenya,,,,SoCha,,
621134,Data Analyst,Data Analyst,"West Chicago, IL",via LinkedIn,Full-time,False,"Illinois, United States",2022-12-31,False,False,United States,,,,Epsilon,"['sas', 'sas', 'sql', 'r', 'python', 'hadoop',...","{'analyst_tools': ['sas', 'word', 'excel', 'po..."
421621,Data Scientist,Big Data Systems Engineer II | Cape Town,"Cape Town, South Africa",via Trabajo.org,Full-time,False,South Africa,2022-12-31,True,False,South Africa,,,,Progressive Edge,"['azure', 'linux', 'terraform']","{'cloud': ['azure'], 'os': ['linux'], 'other':..."


## Adding a Column

### Notes

- If you want to create a column you'll need to use the `df['column_name']` syntax.

### Example

Here we are creating a new column called 'Is Data Analyst' and saying if the column `job_title_short` is equal to 'Data Analyst' then then return 1 if not, return 0. It does this by using `astype(int)`.

In [5]:
df['Is Data Analyst'] = (df.job_title_short == 'Data Analyst').astype(int)

Let's view this new column we created.

In [6]:
df['Is Data Analyst']

287190    0
770443    0
770111    0
465217    0
449221    1
         ..
704528    0
671315    0
621134    1
421621    0
481817    0
Name: Is Data Analyst, Length: 787686, dtype: int64

Did this work? Let's look at cases when this column is greater than 0, which means it is equal to 1 (aka it's true). We can use our row filtering we learned in the last section.

In [7]:
df[df['Is Data Analyst'] > 0]

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills,Is Data Analyst
449221,Data Analyst,وظائف Data Analyst (Part time) – الجيزة,"Giza, El Omraniya, Egypt",via وظائف,Part-time,False,Egypt,2023-12-31,True,False,Egypt,,,,شركة,,,1
443614,Data Analyst,Engineer ii data analyst hybrid,"Aguadilla, Puerto Rico",via Sercanto,Full-time,False,Puerto Rico,2023-12-31,True,False,Puerto Rico,,,,Jobzem (2497612),,,1
605365,Data Analyst,Data Analyst at United Nations Environment Pro...,"Nairobi, Kenya",via BeBee Kenya,Full-time,False,Kenya,2023-12-31,False,False,Kenya,,,,United Nations Environment Programme (UNEP),"['sql', 'python', 'qlik', 'tableau', 'git']","{'analyst_tools': ['qlik', 'tableau'], 'other'...",1
776838,Data Analyst,Data Analyst,Ras Al-Khaimah - Ras al Khaimah - United Arab ...,via BeBee,Full-time,False,United Arab Emirates,2023-12-31,False,False,United Arab Emirates,,,,Work corp,"['sql', 'python', 'tableau', 'power bi']","{'analyst_tools': ['tableau', 'power bi'], 'pr...",1
776816,Data Analyst,Technical Data Analyst,"Manila, Metro Manila, Philippines",via Ai-Jobs.net,Full-time,False,Philippines,2023-12-31,False,False,Philippines,year,89204.0,,SupportNinja,"['windows', 'excel', 'jira']","{'analyst_tools': ['excel'], 'async': ['jira']...",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
512850,Data Analyst,Junior Data Analyst,Australia,via Laimoon.com,Contractor,False,Australia,2022-12-31,True,False,Australia,,,,4C Strategies,"['go', 'qlik']","{'analyst_tools': ['qlik'], 'programming': ['g...",1
690197,Data Analyst,Insurance Data Analyst,Hong Kong,via BeBee 香港,Full-time,False,Hong Kong,2022-12-31,False,False,Hong Kong,,,,10Life Group Limited,"['python', 'sql', 'vba', 'excel', 'power bi', ...","{'analyst_tools': ['excel', 'power bi', 'table...",1
604299,Data Analyst,Data Analyst / Data Steward - Real Estate (m/w...,"Cologne, Germany",via My ArkLaMiss Jobs,Full-time,False,Germany,2022-12-31,True,False,Germany,,,,BNP Paribas Real Estate,,,1
588085,Data Analyst,Data Analyst,"Dublin, Ireland",via BeBee Ireland,Full-time,False,Ireland,2022-12-31,True,False,Ireland,,,,Allied Irish Bank,['gdpr'],{'libraries': ['gdpr']},1


## Dropping Data

### Notes

* Use `drop()` if you want to drop (delete) either a column or row in your database. 
* The syntax is:
    * Drop column: `df.drop('column_name, axis = 1)`
    * Drop row: `df.drop(index_name, axis = 0)`
* If you wanted to drop multiple rows you would have the syntax: 
    * Drop multiple columns: `df.drop(['column_name1', 'column_name2'], axis=1)`
    * Drop multiple rows: `df.drop([index_name1, index_name2], axis=0)`

### Examples

Let's drop the column `'salary_hour_avg'`, this will have an axis of 1 (since we're dropping a column).

In [8]:
df.drop('salary_hour_avg', axis = 1, inplace=True)

Inspecting the columns available now:

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 787686 entries, 287190 to 481817
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   job_title_short        787686 non-null  object 
 1   job_title              787685 non-null  object 
 2   job_location           786646 non-null  object 
 3   job_via                787679 non-null  object 
 4   job_schedule_type      774976 non-null  object 
 5   job_work_from_home     787686 non-null  bool   
 6   search_location        787686 non-null  object 
 7   job_posted_date        787686 non-null  object 
 8   job_no_degree_mention  787686 non-null  bool   
 9   job_health_insurance   787686 non-null  bool   
 10  job_country            787633 non-null  object 
 11  salary_rate            33073 non-null   object 
 12  salary_year_avg        22026 non-null   float64
 13  company_name           787668 non-null  object 
 14  job_skills             670364 non-nu

This column is now removed!

## Remove NA

### Notes

* To remove rows that contain empty cells use `dropna()`. 
* By default `dropna()` will return a *new* DataFrame, and won't change the original. 

### Example

Let's cleanup our `salary_year_avg` column by removing the `NaN` values in this column.

In [10]:
df.salary_year_avg.head()

287190   NaN
770443   NaN
770111   NaN
465217   NaN
449221   NaN
Name: salary_year_avg, dtype: float64

In [11]:
df.dropna(subset=['salary_year_avg'], inplace=True)

In [12]:
df.salary_year_avg.head()

776816     89204.0
290804    191000.0
464653    115000.0
141327    235847.0
416181    210000.0
Name: salary_year_avg, dtype: float64