<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/2_Advanced/03_Pandas_Data_Management.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Data Management

Load data.

In [1]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt  

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

# DataFrame Copy
df_original = df.copy()

## Copy

Recall from the last lesson, when we filled in missing values for median salary.

Here let's make a new dataframe `df_altered` and only make changes to it.

In [2]:
# Create new dataframe
df_altered = df_original

df_altered.loc[:5,'salary_year_avg']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: salary_year_avg, dtype: float64

Let's fill in missing values with the median value.

In [3]:
# Calculating the median salary
median_salary = df_altered['salary_year_avg'].median()

# Filling the missing values with the median salary
df_altered['salary_year_avg'] = df_altered.loc[:,'salary_year_avg'].fillna(median_salary)

In [4]:
df_altered['salary_year_avg'] = df_altered['salary_year_avg'].fillna(median_salary)

Now let's inspect the altered DataFrame.

In [5]:
df_altered.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64

That was good...

But what about the original...

In [6]:
df_original.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64

Holdup!! How the heck did `df_original` get altered!?!

Well both the variables of `df_original` and `df_altered` are referencing the same DataFrame.

In [7]:
print('ID of df_original:               ', id(df_original))
print('ID of df_altered:                ', id(df_altered))
print('Are the two dataframes the same? ', id(df_original) == id(df_altered))

ID of df_original:                6578811792
ID of df_altered:                 6578811792
Are the two dataframes the same?  True


Instead we can use the .copy() method

- `copy()`: Copy a DataFrame.

In [8]:
df_original = df.copy()
df_altered = df_original.copy()

print('ID of df_original:               ', id(df_original))
print('ID of df_altered:                ', id(df_altered))
print('Are the two dataframes the same? ', id(df_original) == id(df_altered))

ID of df_original:                6974615312
ID of df_altered:                 6974615440
Are the two dataframes the same?  False


Now when we do this same operation:

In [9]:
# Calculating the median salary
median_salary = df_altered['salary_year_avg'].median()

# Filling the missing values with the median salary
df_altered['salary_year_avg'] = df_altered['salary_year_avg'].fillna(median_salary)

df_altered.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64

The original dataframe doesn't get altered!

In [10]:
df_original.loc[:5,'salary_year_avg']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: salary_year_avg, dtype: float64

Now that we've created a copy of our data, we want to start our analysis. But if we have a large set of data we only want to take a subset of data to make it more manageable. We can use `sample()` to get a random sample of the data.

## Sample

### Notes

* `sample()`: Random sample of items.

### Examples

Let's get a random sample of the data. You could get a sample with a fixed row number.

In [11]:
df.sample(n=5)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
187522,Data Engineer,Data Engineer Junior - Python / SQL (H/F),"Paris, France",via Indeed,Full-time,False,France,2023-01-25 18:49:43,False,False,France,,,,pasteque.io,"['python', 'sql']","{'programming': ['python', 'sql']}"
344210,Data Analyst,Analyst Customer Data Management,"Makati, Metro Manila, Philippines",via Talentify,Full-time,False,Philippines,2023-09-01 08:33:33,False,False,Philippines,,,,Philip Morris International,"['sap', 'excel']","{'analyst_tools': ['sap', 'excel']}"
454597,Data Analyst,eStore Specialist(Data Analysis)_TC220077,"Bade District, Taoyuan City, Taiwan",via 104人力銀行,,False,Taiwan,2023-09-11 00:24:07,False,False,Taiwan,,,,"Super Micro Computer, Inc._美超微電腦股份有限公司",,
652282,Data Scientist,Data scientist (Deep learning) F/H,"Paris, France",via Cadremploi,Full-time,False,France,2023-03-11 12:29:55,False,False,France,,,,Malakoff Humanis,"['python', 'sql', 'r', 'tensorflow', 'keras', ...","{'libraries': ['tensorflow', 'keras', 'pytorch..."
85117,Senior Data Scientist,Senior Data Scientist,"Quezon City, Metro Manila, Philippines",via GrabJobs,Full-time,False,Philippines,2023-11-10 20:29:05,False,False,Philippines,,,,Avaloq,"['python', 'sql', 'sql server', 'power bi', 'g...","{'analyst_tools': ['power bi'], 'databases': [..."


Or you can randomly select a fraction of the data (e.g., 10% of the rows), with or without replacement.

In [12]:
df.sample(frac=0.1, replace=False)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
31569,Data Engineer,Data Engineer,Spain,via BeBee,Full-time,False,Spain,2023-08-22 07:20:37,True,False,Spain,,,,VISEO - Spain,"['python', 'sql', 'sql server', 'aws', 'tablea...","{'analyst_tools': ['tableau'], 'cloud': ['aws'..."
234493,Software Engineer,Senior Python Engineer,"Colombia, Huila, Colombia",via BeBee,Full-time,False,Colombia,2023-12-05 17:26:46,True,False,Colombia,,,,Thaloz,"['python', 'dynamodb', 'cassandra', 'redis', '...","{'cloud': ['aws', 'snowflake'], 'databases': [..."
196628,Data Engineer,Data Engineer – R01520362,"Bengaluru, Karnataka, India",via Indeed,Full-time,False,India,2023-01-13 18:14:08,True,False,India,,,,Brillio,,
207860,Data Analyst,Business Intelligence Engineer,United Kingdom,via Ai-Jobs.net,Full-time,False,United Kingdom,2023-11-10 17:33:00,False,False,United Kingdom,,,,Aviva,"['sql', 'go', 'tableau', 'power bi']","{'analyst_tools': ['tableau', 'power bi'], 'pr..."
516812,Machine Learning Engineer,Machine Learning Engineer for Artificial Intel...,"Petah Tikva, Israel",via Karkidi,Full-time,False,Israel,2023-01-08 23:38:34,False,False,Israel,,,,Intel Corporation,"['sql', 'python', 'kafka', 'hadoop', 'airflow'...","{'libraries': ['kafka', 'hadoop', 'airflow'], ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
579629,Data Scientist,Sr Data Scientist,"Baltimore, MD",via BeBee,Full-time,False,Georgia,2023-08-18 10:43:55,False,False,United States,,,,Exelon,"['python', 'r', 'scala', 'sql', 'spark', 'hado...","{'libraries': ['spark', 'hadoop', 'pyspark'], ..."
340346,Data Scientist,Data scientist,Anywhere,via LinkedIn تونس,Full-time,True,Tunisia,2023-03-07 09:17:07,False,False,Tunisia,,,,TOP TECH,"['python', 'java', 'sql', 'cassandra', 'databr...","{'analyst_tools': ['datarobot', 'tableau'], 'c..."
239448,Data Engineer,Data Engineers,"Chennai, Tamil Nadu, India",via Indeed,Full-time,False,India,2023-06-24 17:17:44,True,False,India,,,,Fipsar,"['sql', 'aws', 'snowflake']","{'cloud': ['aws', 'snowflake'], 'programming':..."
769032,Business Analyst,Senior Business Analyst/Business Analyst,Hong Kong,via BeBee 香港,Full-time,False,Hong Kong,2023-11-28 02:07:31,False,False,Hong Kong,,,,Inavitas,"['mysql', 'oracle', 'hadoop', 'tableau', 'powe...","{'analyst_tools': ['tableau', 'power bi'], 'cl..."


Now you can analyze these subsets of data. 