## Data analysis for employment in non-profit organization

Summarized of this process, <br />
1. Import CSV file contain the dataset <br />
2. Remove all the Null (NA) dataset from the original dataset. <br />
3. Divide by seven indicators. <br />
4. Measure the best fit of each indicators. <br />
5. Drop (2010-2012) dataset and split (2013-2015), (2016-2018), and (2019-20021). <br />
6. Divide by training (2013-2018) and testing (2019-2021) dataset. (Only testing set will be used) <br />
7. Divide by characteristic types (based on age, gender, education, and immigrant). <br />
8. Other characteristic types will be dropped. <br />
9. Divide dataset by provinces but only select five provinces and merge into previous divided dataset.

Import all requirement,

In [3]:
import pandas as pd
import numpy as np
import ydata_profiling as pp  
from ydata_profiling import ProfileReport 
import warnings
import os

warnings.filterwarnings('ignore')

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt
from fitter import Fitter, get_common_distributions, get_distributions

from sklearn.linear_model import LinearRegression
from scipy.stats import chi2_contingency

import datetime as dt

<h3> Part 1 - Import CSV </h3>

Import unemployment dataset.

In [5]:
df = pd.read_csv('36100651.csv')

print(df.info())
print(df.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105840 entries, 0 to 105839
Data columns (total 17 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   REF_DATE         105840 non-null  int64  
 1   GEO              105840 non-null  object 
 2   DGUID            105840 non-null  object 
 3   Sector           105840 non-null  object 
 4   Characteristics  105840 non-null  object 
 5   Indicators       105840 non-null  object 
 6   UOM              105840 non-null  object 
 7   UOM_ID           105840 non-null  int64  
 8   SCALAR_FACTOR    105840 non-null  object 
 9   SCALAR_ID        105840 non-null  int64  
 10  VECTOR           105840 non-null  object 
 11  COORDINATE       105840 non-null  object 
 12  VALUE            102816 non-null  float64
 13  STATUS           3024 non-null    object 
 14  SYMBOL           0 non-null       float64
 15  TERMINATED       0 non-null       float64
 16  DECIMALS         105840 non-null  int6

<h3> Part 2 - Filter NA </h3>

Filter only the essential columns of the original dataset.

In [6]:
print("Grab the only the essential part of database.")

# From the original, 
# UOM_ID, SCALAR_ID, VECTOR, COORDINATE, STATUS, SYMBOL, TERMINATED, and DECIMALS columns are removed.

df_sorted = df[['REF_DATE','DGUID','GEO','Sector','Characteristics','Indicators','UOM','SCALAR_FACTOR','VALUE']]

print(df_sorted.head(20))
print(df_sorted.info())

print("Sort by Characteristics")
grouped = df_sorted.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.size]))

print("Sort by Indicator")
grouped = df_sorted.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.size]))


Grab the only the essential part of database.
    REF_DATE           DGUID     GEO                         Sector  \
0       2010  2016A000011124  Canada  Total non-profit institutions   
1       2010  2016A000011124  Canada  Total non-profit institutions   
2       2010  2016A000011124  Canada  Total non-profit institutions   
3       2010  2016A000011124  Canada  Total non-profit institutions   
4       2010  2016A000011124  Canada  Total non-profit institutions   
5       2010  2016A000011124  Canada  Total non-profit institutions   
6       2010  2016A000011124  Canada  Total non-profit institutions   
7       2010  2016A000011124  Canada  Total non-profit institutions   
8       2010  2016A000011124  Canada  Total non-profit institutions   
9       2010  2016A000011124  Canada  Total non-profit institutions   
10      2010  2016A000011124  Canada  Total non-profit institutions   
11      2010  2016A000011124  Canada  Total non-profit institutions   
12      2010  2016A000011124  C

Check for the missing value from the sorted dataset done above.
* Notice there is missing value in this dataset.
* Based on "VALUE" records, there's are 2.86% of the data are missing.

In [7]:
# Value for "STATUS", "SYMBOL", and "TERMINATED" will be removed after this analysis.
# They contains non-meanful data inside.

percent_missing_df = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'percent_in_na': percent_missing_df,
                                 'num_of_na': df.isnull().sum(),
                                 'total_sample': len(df)})
print("Original database null counter")
print(missing_value_df)

# Noticed that, there's 2.85% of the data (VALUE) is missing.
# To straight forward those missing data, I have decided to further removed some of the missing values.

percent_missing_df_sorted = df_sorted.isnull().sum() * 100 / len(df_sorted)
missing_value_df_sorted = pd.DataFrame({'percent_in_na': percent_missing_df_sorted,
                                 'num_of_na': df_sorted.isnull().sum(),
                                 'total_sample': len(df_sorted)})
print("\nModified dataset null counter.")
print(missing_value_df_sorted)

Original database null counter
                 percent_in_na  num_of_na  total_sample
REF_DATE              0.000000          0        105840
GEO                   0.000000          0        105840
DGUID                 0.000000          0        105840
Sector                0.000000          0        105840
Characteristics       0.000000          0        105840
Indicators            0.000000          0        105840
UOM                   0.000000          0        105840
UOM_ID                0.000000          0        105840
SCALAR_FACTOR         0.000000          0        105840
SCALAR_ID             0.000000          0        105840
VECTOR                0.000000          0        105840
COORDINATE            0.000000          0        105840
VALUE                 2.857143       3024        105840
STATUS               97.142857     102816        105840
SYMBOL              100.000000     105840        105840
TERMINATED          100.000000     105840        105840
DECIMALS         

Dropping missing value from the sorted dataset.

In [8]:
df_sorted_na = df_sorted.dropna()

Check now if there's still a missing data inside modified sorted dataset done above.

In [9]:
print("Modified dataset modification after removing missing value and it's total counter")

percent_missing_df_sorted_na = df_sorted_na.isnull().sum() * 100 / len(df_sorted_na)
missing_value_df_sorted_na = pd.DataFrame({'percent_in_na': percent_missing_df_sorted_na})
print(missing_value_df_sorted_na)
# print(df_sorted_na.head(20))

print(df_sorted_na.info())
grouped = df_sorted_na.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.size]))

grouped = df_sorted_na.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.size]))

Modified dataset modification after removing missing value and it's total counter
                 percent_in_na
REF_DATE                   0.0
DGUID                      0.0
GEO                        0.0
Sector                     0.0
Characteristics            0.0
Indicators                 0.0
UOM                        0.0
SCALAR_FACTOR              0.0
VALUE                      0.0
<class 'pandas.core.frame.DataFrame'>
Index: 102816 entries, 0 to 105839
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   REF_DATE         102816 non-null  int64  
 1   DGUID            102816 non-null  object 
 2   GEO              102816 non-null  object 
 3   Sector           102816 non-null  object 
 4   Characteristics  102816 non-null  object 
 5   Indicators       102816 non-null  object 
 6   UOM              102816 non-null  object 
 7   SCALAR_FACTOR    102816 non-null  object 
 8   VALUE            102816 non-

<h3> Panda profiling from the datasets</h3>

Panda Profiling for original dataset (CSV file),

In [10]:
# https://medium.com/analytics-vidhya/pandas-profiling-5ecd0b977ecd

pp = ProfileReport(df, title="Pandas Profiling Report")
pp_df = pp.to_html()

f = open("df_NoMod.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset:  59%|█████▉    | 13/22 [00:02<00:01,  5.74it/s, Describe variable:TERMINATED]    

Summarize dataset: 100%|██████████| 31/31 [00:07<00:00,  4.11it/s, Completed]                   
Generate report structure: 100%|██████████| 1/1 [00:07<00:00,  7.79s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.43s/it]


Panda Profiling for sorted dataset,

In [11]:
pp_sorted = ProfileReport(df_sorted, title="Pandas Profiling Report with Columns Sorted")
pp_df_sorted = pp_sorted.to_html()

f = open("df_Sorted.html", "a") # Expert modifying data into html file.
f.write(pp_df_sorted)
f.close()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset: 100%|██████████| 22/22 [00:05<00:00,  3.76it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.26s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.32it/s]


Panda Profiling for modified sorted dataset (missing data removed),

In [12]:
pp = ProfileReport(df_sorted_na, title="Pandas Profiling Report with Columned Sorted and NA Removed")
pp_df_sorted = pp.to_html()

f = open("df_Sorted-no-na.html", "a") # Expert modifying data into html file.
f.write(pp_df_sorted)
f.close()

Summarize dataset: 100%|██████████| 22/22 [00:04<00:00,  4.50it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.17s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.48it/s]


In [13]:
# Differences should be, there will be less data to work on.
# Particularly business non-profit organizations and community organizations haven't given more accurate data (more missing values).

<h3> This part of code will be used for export into CSV files to demonstrate output in splited portional of code. </h3>

The code below will create directory to organized the result and structured of the file output from this script.

In [14]:
# I created the class method to prevent access the whole system.
# To use this one,
# CreatedTheFile = toOrganizedOutputFiles('name')

# Code done by here, https://www.geeksforgeeks.org/create-a-directory-in-python/
# Example 1

class toOrganizedOutputFiles: # Creating folder for each section
  def __init__(self, name):

    import os

    # Leaf directory  
    directory = name
        
    # Parent Directories  
    parent_dir = ""
        
    # Path  
    path = os.path.join(parent_dir, directory)  

    if os.path.isdir(path):
      print("Directory '% s' is ALREADY created" % directory)  
    else:
      # Create the directory  
      os.makedirs(path)  
      print("Directory '% s' created" % directory)  
    


For clarify, there will be a new directory that stored the result in file based on Indicators columns.

In [15]:
CreatedTheFile = toOrganizedOutputFiles('Result_By_Indicators')

Directory 'Result_By_Indicators' is ALREADY created


<h3> Part 3 - Divide datasets by 'Indicators' </h3>

Next step, I will filtered the dataset by all the 'Indicators' given below. All of them done with modified sorted dataset (filtered missing value)
* Notice there will be seven indicators data inside.
* Notice there will be divided by seven datasets based on indicators.

In [16]:
# All columns
print(df_sorted_na.info())

# All indicators
grouped = df_sorted_na.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.size]))

<class 'pandas.core.frame.DataFrame'>
Index: 102816 entries, 0 to 105839
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   REF_DATE         102816 non-null  int64  
 1   DGUID            102816 non-null  object 
 2   GEO              102816 non-null  object 
 3   Sector           102816 non-null  object 
 4   Characteristics  102816 non-null  object 
 5   Indicators       102816 non-null  object 
 6   UOM              102816 non-null  object 
 7   SCALAR_FACTOR    102816 non-null  object 
 8   VALUE            102816 non-null  float64
dtypes: float64(1), int64(1), object(7)
memory usage: 11.9+ MB
None
                                    size
Indicators                              
Average annual hours worked        14688
Average annual wages and salaries  14688
Average hourly wage                14688
Average weekly hours worked        14688
Hours worked                       14688
Number of jobs         

Average annual hours worked from modified sorted dataset.

In [17]:
# Average annual hours worked        15120
print("\nAverage annual hours worked")
df_AvgAnnHrsWrk = df_sorted_na.loc[
    (df_sorted_na['Indicators'] == 'Average annual hours worked')
]
# grouped = df_AvgAnnHrsWrk.groupby(['GEO'])
grouped = df_AvgAnnHrsWrk.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.median, np.mean, np.std, np.size]))
print("The total number of this one is ",len(df_AvgAnnHrsWrk.index))

sns.displot(data=df_AvgAnnHrsWrk, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
plt.show()


Average annual hours worked
                             median         mean         std   size
Indicators                                                         
Average annual hours worked  1593.0  1551.436138  252.784087  14688
The total number of this one is  14688


Panda Profiling only for "Average annual hours worked"

In [18]:
pp = ProfileReport(df_AvgAnnHrsWrk, title="Average annual hours worked")
pp_df = pp.to_html()

f = open("Result_By_Indicators/Average annual hours worked.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset: 100%|██████████| 22/22 [00:02<00:00,  9.61it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.08s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.30it/s]


Average annual wages and salaries from modified sorted dataset. (Mention above)

In [19]:
# Average annual wages and salaries  15120
print("\nAverage annual wages and salaries")
df_AvgAnnWages = df_sorted_na.loc[
    (df_sorted_na['Indicators'] == 'Average annual wages and salaries')
]
grouped = df_AvgAnnWages.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.median, np.mean, np.std, np.size]))
print("The total number of this one is ",len(df_AvgAnnWages.index))

sns.displot(data=df_AvgAnnWages, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
plt.show()


Average annual wages and salaries
                                    median          mean           std   size
Indicators                                                                   
Average annual wages and salaries  42186.5  43804.782748  16620.351087  14688
The total number of this one is  14688


Panda Profiling only for "Average annual wages and salaries"

In [20]:
pp = ProfileReport(df_AvgAnnWages, title="Average annual wages and salaries")
pp_df = pp.to_html()

f = open("Result_By_Indicators/Average annual wages and salaries.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset: 100%|██████████| 22/22 [00:02<00:00, 10.24it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.54s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.20it/s]


Average hourly wage from modified sorted dataset. (Mentions above)

In [21]:
# Average hourly wage                15120
print("\nAverage hourly wage")
df_AvgHrsWages = df_sorted_na.loc[
    (df_sorted_na['Indicators'] == 'Average hourly wage')
]
grouped = df_AvgHrsWages.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.median, np.mean, np.std, np.size]))
print("The total number of this one is ",len(df_AvgHrsWages.index))

sns.displot(data=df_AvgHrsWages, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
plt.show()


Average hourly wage
                     median       mean       std   size
Indicators                                             
Average hourly wage    26.7  27.825611  8.601721  14688
The total number of this one is  14688


Panda Profiling only for "Average hourly wages"

In [22]:
pp = ProfileReport(df_AvgHrsWages, title="Average hourly wage")
pp_df = pp.to_html()

f = open("Result_By_Indicators/Average hourly wages.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset: 100%|██████████| 22/22 [00:02<00:00,  8.52it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.22s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.33it/s]


Average weekly hours worked from modified sorted dataset.

In [23]:
# Average weekly hours worked        15120
print("\nAverage weekly hours worked")
df_AvgWeekHrsWrked = df_sorted_na.loc[
    (df_sorted_na['Indicators'] == 'Average weekly hours worked')
]
grouped = df_AvgWeekHrsWrked.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.median, np.mean, np.std, np.size]))
print("The total number of this one is ",len(df_AvgWeekHrsWrked.index))

sns.displot(data=df_AvgWeekHrsWrked, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
plt.show()


Average weekly hours worked
                             median       mean      std   size
Indicators                                                    
Average weekly hours worked    31.0  29.831767  4.86689  14688
The total number of this one is  14688


Panda Profiling only for "Average weekly hours worked"

In [24]:
pp = ProfileReport(df_AvgWeekHrsWrked, title="Average weekly hours worked")
pp_df = pp.to_html()

f = open("Result_By_Indicators/Average weekly hours worked.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset: 100%|██████████| 22/22 [00:02<00:00,  7.49it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.36s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s]


Hours worked from modified sorted dataset.
* Notice, Skewed left.

In [25]:
# Hours worked                       15120
print("\nHours worked")
df_Hrs_Wrked = df_sorted_na.loc[
    (df_sorted_na['Indicators'] == 'Hours worked')
]
grouped = df_Hrs_Wrked.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.median, np.mean, np.std, np.size]))
print(grouped['VALUE'].agg([np.amin, np.amax]))
print("The total number of this one is ",len(df_Hrs_Wrked.index))

sns.displot(data=df_Hrs_Wrked, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
plt.show()


Hours worked
              median          mean            std   size
Indicators                                              
Hours worked  9586.5  83596.946283  253684.449101  14688
              amin       amax
Indicators                   
Hours worked   6.0  3857813.0
The total number of this one is  14688


Panda Profiling only for "Hours worked" (Skewed left, noticed)

In [26]:
pp = ProfileReport(df_Hrs_Wrked, title="Hours Worked")
pp_df = pp.to_html()

f = open("Result_By_Indicators/Hours worked.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset: 100%|██████████| 22/22 [00:02<00:00,  9.49it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:05<00:00,  5.18s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s]


Number of jobs from modified sorted dataset.
* Notice, skewed left.

In [27]:
# Number of jobs                     15120
print("\nNumber of jobs")
df_NumOfJob = df_sorted_na.loc[
    (df_sorted_na['Indicators'] == 'Number of jobs')
]
grouped = df_NumOfJob.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.median, np.mean, np.std, np.size]))
print(grouped['VALUE'].agg([np.amin, np.amax]))
print("The total number of this one is ",len(df_NumOfJob.index))

sns.displot(data=df_NumOfJob, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
plt.show()


Number of jobs
                median          mean            std   size
Indicators                                                
Number of jobs  6305.5  53441.062432  161120.806948  14688
                amin       amax
Indicators                     
Number of jobs  11.0  2428289.0
The total number of this one is  14688


Panda Profiling only for "Number of the jobs" (Stewed toward left)

In [28]:
pp = ProfileReport(df_NumOfJob, title="Number of jobs")
pp_df = pp.to_html()

f = open("Result_By_Indicators/Number of jobs.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset:   7%|▋         | 1/14 [00:00<00:01,  7.54it/s, Describe variable:REF_DATE]

Summarize dataset: 100%|██████████| 22/22 [00:02<00:00,  7.98it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.78s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.34it/s]


Wages and salaries from modified sorted dataset.

* Noticed skewed left.

In [29]:
# Wages and salaries                 15120
print("\nWages and salaries")
df_WagesAndSalaries = df_sorted_na.loc[
    (df_sorted_na['Indicators'] == 'Wages and salaries')
]
grouped = df_WagesAndSalaries.groupby(['Indicators'])
print(grouped['VALUE'].agg([np.median, np.mean, np.std, np.size]))
print(grouped['VALUE'].agg([np.amin, np.amax]))
print("The total number of this one is ",len(df_WagesAndSalaries.index))

sns.displot(data=df_WagesAndSalaries, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
plt.show()


Wages and salaries
                    median         mean          std   size
Indicators                                                 
Wages and salaries   224.0  2484.918233  7977.441388  14688
                    amin      amax
Indicators                        
Wages and salaries   0.0  132601.0
The total number of this one is  14688


Panda Profiling only for "Wages and salaries" (Strewed toward left)

In [30]:
pp = ProfileReport(df_WagesAndSalaries, title="Wages and Salaries")
pp_df = pp.to_html()

f = open("Result_By_Indicators/Wages and salaries.html", "a")  # Expert into html file without modifying any columns in dataset.
f.write(pp_df)
f.close()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset: 100%|██████████| 22/22 [00:02<00:00,  8.73it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.71s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.33it/s]


<h3> Part 4 - Best fit for each 'Indicators' </h3>

Next step as analysis, I use "Fitter" to analysis the best fit for the values. The values are distributed correctly. However, it is not normally distributed.

Best distribution for "Average annual hours worked"

In [31]:
# Not noramlly distirbuted, skewed toward right

# https://medium.com/the-researchers-guide/finding-the-best-distribution-that-fits-your-data-using-pythons-fitter-library-319a5a0972e9
# https://www.datacamp.com/tutorial/probability-distributions-python
# https://realpython.com/python-histograms/
# https://seaborn.pydata.org/tutorial/introduction.html
# https://aminazahid45.medium.com/seaborn-in-python-76f44752a7c8
# https://stackoverflow.com/questions/26597116/seaborn-plots-not-showing-up
# https://www.analyticsvidhya.com/blog/2021/02/statistics-101-beginners-guide-to-continuous-probability-distributions/

fa = Fitter(df_AvgAnnHrsWrk["VALUE"].values,
           distributions=['gamma',
                          'lognorm',
                          "beta",
                          "burr",
                          "norm"])
fa.fit()
fa.summary()
fa.get_best(method = 'sumsquare_error')

{'beta': {'a': 638.3317504246395,
  'b': 31.78242716784456,
  'loc': -26340.79799985259,
  'scale': 29280.128191604977}}

Best distribution for "Average annual wages and salaries"

In [32]:
fa = Fitter(df_AvgAnnWages["VALUE"].values,
           distributions=['gamma',
                          'lognorm',
                          "beta",
                          "burr",
                          "norm"])
fa.fit()
fa.summary()
fa.get_best(method = 'sumsquare_error')

{'gamma': {'a': 10.317012968188816,
  'loc': -8874.841101131682,
  'scale': 5106.09266888318}}

Best distribution for "Average hourly wage"

In [33]:
fa = Fitter(df_AvgHrsWages["VALUE"].values,
           distributions=['gamma',
                          'lognorm',
                          "beta",
                          "burr",
                          "norm"])
fa.fit()
fa.summary()
fa.get_best(method = 'sumsquare_error')

{'lognorm': {'s': 0.32868880665290323,
  'loc': 2.5266200749710075,
  'scale': 23.96526012956064}}

Best distribution for "Average weekly hours worked"

In [34]:
fa = Fitter(df_AvgWeekHrsWrked["VALUE"].values,
           distributions=['gamma',
                          'lognorm',
                          "beta",
                          "burr",
                          "norm"])
fa.fit()
fa.summary()
fa.get_best(method = 'sumsquare_error')

{'burr': {'c': 24.425610086588648,
  'd': 0.2639474187210325,
  'loc': -0.19516675888992813,
  'scale': 33.9923544234202}}

Best distribution for "Hours Worked"

In [35]:
fa = Fitter(df_Hrs_Wrked["VALUE"].values,
           distributions=['gamma',
                          'lognorm',
                          "beta",
                          "burr",
                          "norm"])
fa.fit()
fa.summary()
fa.get_best(method = 'sumsquare_error')

{'lognorm': {'s': 2.500328743903225,
  'loc': 5.890210931623885,
  'scale': 8052.849839197419}}

Best distrubution for "Number of jobs"

In [36]:
fa = Fitter(df_NumOfJob["VALUE"].values,
           distributions=['gamma',
                          'lognorm',
                          "beta",
                          "burr",
                          "norm"])
fa.fit()
fa.summary()
fa.get_best(method = 'sumsquare_error')

{'beta': {'a': 0.44093091654590877,
  'b': 478.35910204629215,
  'loc': 10.999999999999998,
  'scale': 18916817.838308237}}

Best distribution for "Wages and Salaries"

In [37]:
fa = Fitter(df_WagesAndSalaries["VALUE"].values,
           distributions=['gamma',
                          'lognorm',
                          "beta",
                          "burr",
                          "norm"])
fa.fit()
fa.summary()
fa.get_best(method = 'sumsquare_error')

{'beta': {'a': 0.45806730861492995,
  'b': 504.0982556396307,
  'loc': -3.462059833380747e-23,
  'scale': 927559.1543601784}}

Before, I go more further in the exercise, I will going to export the process that I have finished. This csv file can be used to regenerate this components of the code again.

In [38]:
# Save the dataframe to a CSV file
df_sorted_na.to_csv('Result_By_Indicators/df_byIndicator.csv', index=False)


<h3> Cohort Analysis, 'commented' </h3>

<b> Cohort Analaysis work is performed both Excel and Python <br />
However, output result is not showing the result what I am expected <br />
Therefore, they are commented. However, the excel work is posted in Github </b>

Cohert Analysis before modifying the whole dataset.<br />
Excel work is done in separate files.

In [39]:
# # Excel work is done via this Youtube.
# # https://www.youtube.com/watch?v=dEhlop5ekYM&t=11s

# data_galaxy = df_AvgAnnHrsWrk.copy()
# data_galaxy['REF_DATE'] = data_galaxy['REF_DATE'].astype(str)
# # data_galaxy['REF_DATE'] = pd.to_datetime(data_galaxy["REF_DATE"])

# data_galaxy['FAKE_DATE'] = df.loc[:, 'REF_DATE']
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].astype(str)

# data_galaxy = data_galaxy.sort_values('FAKE_DATE')

# print(data_galaxy['FAKE_DATE'].unique())
# print(data_galaxy.info())

# # Create fake date as following.
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2010', '01-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2011', '02-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2012', '03-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2013', '04-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2014', '05-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2015', '06-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2016', '07-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2017', '08-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2018', '09-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2019', '10-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2020', '11-01-2010')
# data_galaxy['FAKE_DATE'] = data_galaxy['FAKE_DATE'].str.replace('2021', '12-01-2010')

# print(data_galaxy.info())

# data_galaxy['REF_DATE'] = pd.to_datetime(data_galaxy["REF_DATE"])
# data_galaxy['FAKE_DATE'] = pd.to_datetime(data_galaxy["FAKE_DATE"])

# # Export into CSV file, before Cohort Analysis begins.
# # data_galaxy.to_csv('df_AvgAnnHrsWrk_test.csv', index=False)

In [40]:
# # https://saturncloud.io/blog/converting-a-column-to-date-format-in-pandas-dataframe/
# # https://sparkbyexamples.com/pandas/pandas-extract-year-from-datetime/
# # https://saturncloud.io/blog/converting-a-column-to-date-format-in-pandas-dataframe/

# # https://www.askpython.com/python/examples/cohort-analysis
# # https://www.activestate.com/blog/cohort-analysis-with-python/


# # Make a copy
# data_galaxy = df_AvgAnnHrsWrk .copy()

        
# data_galaxy['REF_DATE'] = data_galaxy['REF_DATE'].astype(str)
# data_galaxy['REF_DATE'] = pd.to_datetime(data_galaxy["REF_DATE"])

# data_galaxy = data_galaxy.sort_values('REF_DATE')

# data_galaxy = data_galaxy.sort_values('FAKE_DATE')

# def getting_months(m):
#     return dt.datetime(m.year, m.month,1)

# #function for data to create a series
# def get_date_elements(df, column):
#     day = df[column].dt.day
#     month = df[column].dt.month
#     year = df[column].dt.year
#     return day, month, year 

# # using the above function
# data_galaxy['Invoice-Month'] = data_galaxy['FAKE_DATE'].apply(getting_months) # data_galaxy['REF_DATE'].apply(getting_months)
# # self.data_galaxy['Invoice-Month'] = self.data_galaxy['Invoice-Month']

# # indexing a column for the first month visit of the customer
# data_galaxy['Cohort-Month'] = data_galaxy.groupby('DGUID')['Invoice-Month'].transform('min')
# # data_galaxy['Cohort-Month'] = data_galaxy['Cohort-Month'].dt.to_period('Y')
# data_galaxy.head(30)

# # getting date for columns and invoice
# _,Invoiceofmonth,Invoiceofyear = get_date_elements(data_galaxy,'Invoice-Month')
# _,Cohortofmonth,Cohortofyear =  get_date_elements(data_galaxy,'Cohort-Month')

# # cohort index creation
# yeardifference = Invoiceofyear -Cohortofyear
# monthdifference = Invoiceofmonth - Cohortofmonth
# data_galaxy['Cohort-Index'] = yeardifference*12 +monthdifference+1

# # counting customer ID 
# cohort_data = data_galaxy.groupby(['Cohort-Month','Cohort-Index'])['DGUID'].apply(pd.Series.nunique).reset_index()
        
# # pivot table creation
# cohort_table = cohort_data.pivot(index='Cohort-Month', columns=['Cohort-Index'],values='DGUID')
        
# # changing index of the cohort table
# cohort_table.index = cohort_table.index.strftime('%Y') # ('%B %Y')
        
# # creation of heatmap and visualization
# plt.figure(figsize=(21,10))
# sns.heatmap(cohort_table,annot=True,cmap='Greens')
        
# # cohort for percentage analysis
# new_cohort_table = cohort_table.divide(cohort_table.iloc[:,0],axis=0)
        
# # creating a percentage visualization
# plt.figure(figsize=(21,10))
# colormap=sns.color_palette("mako", as_cmap=True)
# sns.heatmap(new_cohort_table,annot=True,fmt='.0%',cmap=colormap)
# # display the heatmaps.
# plt.show()

<h3> Step 5 - Division by years </h3>
Each year is divided by three years, (2010-2012), (2013-2015), (2016-2018), (2019-2021)

For next step, I will divide each Indicators dataset into four different datasets.<br />
They are 2010-2012 (dropped), 2013-2015, 2016-2018, 2019-2021.<br />
I have dataset prepared before 2010-2012. However, it will not be used after this section, there's too much to analysis to do.<br />
It will also demonstrate here why.

In [41]:
print("There are seven Indicators to analysis,")
grouped = df_sorted_na.groupby('Indicators')
print(grouped['VALUE'].agg([np.size]))

print("\nThe data inside between 2010-2013, there's are # number of data and I will be repeating this seven more time.,")
df_Avg_Sample = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2010) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2011) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2012)
]

grouped = df_Avg_Sample.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.size]))

print("\nTo data inside above 2013 and split into three datasets, I need to repeat this analysis for "+str(7*3)+" (7x3) times.")
print("\nThis is also total of spliting into "+str(7*3)+" datasets.")

df_Avg_Sample_2013 = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2013) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2014) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2015)
]

df_Avg_Sample_2016 = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2016) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2017) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2018)
]

df_Avg_Sample_2019 = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2019) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2020) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2021)
]

grouped = df_Avg_Sample_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.size]))

grouped = df_Avg_Sample_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.size]))

grouped = df_Avg_Sample_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.size]))

There are seven Indicators to analysis,
                                    size
Indicators                              
Average annual hours worked        14688
Average annual wages and salaries  14688
Average hourly wage                14688
Average weekly hours worked        14688
Hours worked                       14688
Number of jobs                     14688
Wages and salaries                 14688

The data inside between 2010-2013, there's are # number of data and I will be repeating this seven more time.,
          size
REF_DATE      
2010      1224
2011      1224
2012      1224

To data inside above 2013 and split into three datasets, I need to repeat this analysis for 21 (7x3) times.

This is also total of spliting into 21 datasets.
          size
REF_DATE      
2013      1224
2014      1224
2015      1224
          size
REF_DATE      
2016      1224
2017      1224
2018      1224
          size
REF_DATE      
2019      1224
2020      1224
2021      1224


Grabbing the year (REF_DATE) from 2010, 2013, 2016, 2018, and 2019 individually for "Average annual hours worked".

In [42]:
# 2010-2012
df_AvgAnnHrsWrk_2010 = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2010) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2011) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2012)
]

grouped = df_AvgAnnHrsWrk_2010.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                sum  size
REF_DATE                 
2010      1905000.0  1224
2011      1909823.0  1224
2012      1915650.0  1224


In [43]:
print("Grabbing the data from 2013, 2016, and 2019.")

# 2013 - 2015
df_AvgAnnHrsWrk_2013 = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2013) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2014) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2015)
]

# 2016 - 2018
df_AvgAnnHrsWrk_2016 = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2016) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2017) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2018)
]

# 2019- 2021
df_AvgAnnHrsWrk_2019 = df_AvgAnnHrsWrk.loc[
    (df_AvgAnnHrsWrk['REF_DATE'] == 2019) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2020) |
    (df_AvgAnnHrsWrk['REF_DATE'] == 2021)
]

grouped = df_AvgAnnHrsWrk_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgAnnHrsWrk_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgAnnHrsWrk_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

Grabbing the data from 2013, 2016, and 2019.
                sum  size
REF_DATE                 
2013      1906851.0  1224
2014      1898034.0  1224
2015      1907286.0  1224
                sum  size
REF_DATE                 
2016      1899738.0  1224
2017      1881389.0  1224
2018      1894227.0  1224
                sum  size
REF_DATE                 
2019      1894126.0  1224
2020      1873544.0  1224
2021      1901826.0  1224


Panda Profiling for year 2016, 2018, and 2020 for "Average annual hours worked".

In [44]:
# # 2016-2017
# pp = ProfileReport(df_AvgAnnHrsWrk_2013, title="Average annual hours worked 2013")
# pp_df = pp.to_html()

# f = open("Average annual hours worked 2016.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# # 2017 - 2019
# pp = ProfileReport(df_AvgAnnHrsWrk_2016, title="Average annual hours worked 2016")
# pp_df = pp.to_html()

# f = open("Average annual hours worked 2018.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# # 2020 - 2021
# pp = ProfileReport(df_AvgAnnHrsWrk_2019, title="Average annual hours worked 2019")
# pp_df = pp.to_html()

# f = open("Average annual hours worked 2020.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

Grabbing the year (REF_DATE) from 2010, 2013, 2016, and 2019 individually for "Average annual wages and salaries".

In [45]:
# 2010 - 2012
df_AvgAnnWages_2010 = df_AvgAnnWages.loc[
    (df_AvgAnnWages['REF_DATE'] == 2010) |
    (df_AvgAnnWages['REF_DATE'] == 2011) |
    (df_AvgAnnWages['REF_DATE'] == 2012)
]

grouped = df_AvgAnnWages_2010.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                 sum  size
REF_DATE                  
2010      48093832.0  1224
2011      49165720.0  1224
2012      50085622.0  1224


In [46]:
print("Grabbing the data from 2017, 2019, and 2021.")

# 2013 - 2015
df_AvgAnnWages_2013 = df_AvgAnnWages.loc[
    (df_AvgAnnWages['REF_DATE'] == 2013) |
    (df_AvgAnnWages['REF_DATE'] == 2014) |
    (df_AvgAnnWages['REF_DATE'] == 2015)
]

# 2016 - 2018
df_AvgAnnWages_2016 = df_AvgAnnWages.loc[
    (df_AvgAnnWages['REF_DATE'] == 2016) |
    (df_AvgAnnWages['REF_DATE'] == 2017) |
    (df_AvgAnnWages['REF_DATE'] == 2018)
]

# 2019 - 2021
df_AvgAnnWages_2019 = df_AvgAnnWages.loc[
    (df_AvgAnnWages['REF_DATE'] == 2019) |
    (df_AvgAnnWages['REF_DATE'] == 2020) |
    (df_AvgAnnWages['REF_DATE'] == 2021)
]

grouped = df_AvgAnnWages_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgAnnWages_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgAnnWages_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

Grabbing the data from 2017, 2019, and 2021.
                 sum  size
REF_DATE                  
2013      50598135.0  1224
2014      51805889.0  1224
2015      52715143.0  1224
                 sum  size
REF_DATE                  
2016      53166285.0  1224
2017      53965359.0  1224
2018      55525920.0  1224
                 sum  size
REF_DATE                  
2019      56997121.0  1224
2020      60597775.0  1224
2021      60687848.0  1224


Panda Profiling for year 2013, 2016, and 2019 for "Average annual wages and salaries".

In [47]:
# pp = ProfileReport(df_AvgAnnWages_2013, title="Average annual wages and salaries 2013")
# pp_df = pp.to_html()

# f = open("Average annual wages and salaries 2013.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_AvgAnnWages_2016, title="Average annual wages and salaries 2016")
# pp_df = pp.to_html()

# f = open("Average annual wages and salaries 2016.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_AvgAnnWages_2019, title="Average annual wages and salaries 2019")
# pp_df = pp.to_html()

# f = open("Average annual wages and salaries 2019.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

Grabbing the year (REF_DATE) from 2010, 2013, 2016, and 2019 individually for "Average hourly wages".

In [48]:
# 2010 - 2012
df_AvgHrsWages_2010 = df_AvgHrsWages.loc[
    (df_AvgHrsWages['REF_DATE'] == 2010) |
    (df_AvgHrsWages['REF_DATE'] == 2011) |
    (df_AvgHrsWages['REF_DATE'] == 2012)
]

grouped = df_AvgHrsWages_2010.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

               sum  size
REF_DATE                
2010      30453.20  1224
2011      31053.73  1224
2012      31511.37  1224


In [49]:
print("Grabbing the data from 2013, 2016, and 2019.")

# 2013 - 2015
df_AvgHrsWages_2013 = df_AvgHrsWages.loc[
    (df_AvgHrsWages['REF_DATE'] == 2013) |
    (df_AvgHrsWages['REF_DATE'] == 2014) |
    (df_AvgHrsWages['REF_DATE'] == 2015)
]

# 2016 - 2018
df_AvgHrsWages_2016 = df_AvgHrsWages.loc[
    (df_AvgHrsWages['REF_DATE'] == 2016) |
    (df_AvgHrsWages['REF_DATE'] == 2017) |
    (df_AvgHrsWages['REF_DATE'] == 2018)
]

# 2019 - 2021
df_AvgHrsWages_2019 = df_AvgHrsWages.loc[
    (df_AvgHrsWages['REF_DATE'] == 2019) |
    (df_AvgHrsWages['REF_DATE'] == 2020) |
    (df_AvgHrsWages['REF_DATE'] == 2021)
]

grouped = df_AvgHrsWages_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgHrsWages_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgHrsWages_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

Grabbing the data from 2013, 2016, and 2019.
               sum  size
REF_DATE                
2013      31932.19  1224
2014      32916.44  1224
2015      33362.79  1224
               sum  size
REF_DATE                
2016      33756.84  1224
2017      34571.41  1224
2018      35356.16  1224
               sum  size
REF_DATE                
2019      36293.34  1224
2020      38980.38  1224
2021      38514.73  1224


Panda Profiling for year 2013, 2016, and 2019 for "Average hourly wages".

In [50]:
# pp = ProfileReport(df_AvgHrsWages_2013, title="Average hourly wage 2013")
# pp_df = pp.to_html()

# f = open("Average hourly wages 2013.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_AvgHrsWages_2016, title="Average hourly wage 2016")
# pp_df = pp.to_html()

# f = open("Average hourly wages 2016.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_AvgHrsWages_2019, title="Average hourly wage 2019")
# pp_df = pp.to_html()

# f = open("Average hourly wages 2019.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

Grabbing the year (REF_DATE) from 2010, 2013, 2016, and 2019 individually for "Average weekly hours worked".

In [51]:
# 2010 - 2012
df_AvgWeekHrsWrked_2010 = df_AvgWeekHrsWrked.loc[
    (df_AvgWeekHrsWrked['REF_DATE'] == 2010) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2011) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2012)
]

grouped = df_AvgWeekHrsWrked_2010.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

              sum  size
REF_DATE               
2010      36645.0  1224
2011      36732.0  1224
2012      36842.0  1224


In [52]:
print("Grabbing the data from 2013, 2016, and 2019.")

# 2013 - 2015
df_AvgWeekHrsWrked_2013 = df_AvgWeekHrsWrked.loc[
    (df_AvgWeekHrsWrked['REF_DATE'] == 2013) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2014) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2015)
]

# 2016 - 2018
df_AvgWeekHrsWrked_2016 = df_AvgWeekHrsWrked.loc[
    (df_AvgWeekHrsWrked['REF_DATE'] == 2016) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2017) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2018)
]

# 2019 - 2021
df_AvgWeekHrsWrked_2019 = df_AvgWeekHrsWrked.loc[
    (df_AvgWeekHrsWrked['REF_DATE'] == 2019) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2020) |
    (df_AvgWeekHrsWrked['REF_DATE'] == 2021)
]

grouped = df_AvgWeekHrsWrked_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgWeekHrsWrked_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_AvgWeekHrsWrked_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

Grabbing the data from 2013, 2016, and 2019.
              sum  size
REF_DATE               
2013      36671.0  1224
2014      36483.0  1224
2015      36678.0  1224
              sum  size
REF_DATE               
2016      36541.0  1224
2017      36170.0  1224
2018      36416.0  1224
              sum  size
REF_DATE               
2019      36400.0  1224
2020      36036.0  1224
2021      36555.0  1224


Panda Profiling for year 2013, 2016, and 2019 for "Average weekly hours worked".

In [53]:
# pp = ProfileReport(df_AvgWeekHrsWrked_2013, title="Average weekly hours worked 2013")
# pp_df = pp.to_html()

# f = open("Average weekly hours worked 2013.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_AvgWeekHrsWrked_2016, title="Average weekly hours worked 2016")
# pp_df = pp.to_html()

# f = open("Average weekly hours worked 2016.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_AvgWeekHrsWrked_2019, title="Average weekly hours worked 2019")
# pp_df = pp.to_html()

# f = open("Average weekly hours worked 2019.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

Grabbing the year (REF_DATE) from 2010, 2013, 2016, and 2019 individually for "hours worked".

In [54]:
# 2010 - 2012
df_Hrs_Wrked_2010 = df_Hrs_Wrked.loc[
    (df_Hrs_Wrked['REF_DATE'] == 2010) |
    (df_Hrs_Wrked['REF_DATE'] == 2011) |
    (df_Hrs_Wrked['REF_DATE'] == 2012)
]

grouped = df_Hrs_Wrked_2010.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                 sum  size
REF_DATE                  
2010      94217881.0  1224
2011      95953248.0  1224
2012      97724232.0  1224


In [55]:
print("Grabbing the data from 2013, 2016, and 2019.")

# 2013 - 2015
df_Hrs_Wrked_2013 = df_Hrs_Wrked.loc[
    (df_Hrs_Wrked['REF_DATE'] == 2013) |
    (df_Hrs_Wrked['REF_DATE'] == 2014) |
    (df_Hrs_Wrked['REF_DATE'] == 2015)
]

# 2016 - 2018
df_Hrs_Wrked_2016 = df_Hrs_Wrked.loc[
    (df_Hrs_Wrked['REF_DATE'] == 2016) |
    (df_Hrs_Wrked['REF_DATE'] == 2017) |
    (df_Hrs_Wrked['REF_DATE'] == 2018)
]

# 2019 - 2021
df_Hrs_Wrked_2019 = df_Hrs_Wrked.loc[
    (df_Hrs_Wrked['REF_DATE'] == 2019) |
    (df_Hrs_Wrked['REF_DATE'] == 2020) |
    (df_Hrs_Wrked['REF_DATE'] == 2021)
]

grouped = df_Hrs_Wrked_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_Hrs_Wrked_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_Hrs_Wrked_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

Grabbing the data from 2013, 2016, and 2019.
                  sum  size
REF_DATE                   
2013       98935086.0  1224
2014       99777902.0  1224
2015      101927894.0  1224
                  sum  size
REF_DATE                   
2016      103980992.0  1224
2017      103906357.0  1224
2018      106885614.0  1224
                  sum  size
REF_DATE                   
2019      108384508.0  1224
2020      103897606.0  1224
2021      112280627.0  1224


Panda Profiling for year 2013, 2016, and 2019 for "hours worked".

In [56]:
# pp = ProfileReport(df_Hrs_Wrked_2013, title="Hours Worked 2013")
# pp_df = pp.to_html()

# f = open("Hours worked 2013.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_Hrs_Wrked_2016, title="Hours Worked 2016")
# pp_df = pp.to_html()

# f = open("Hours worked 2016.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_Hrs_Wrked_2019, title="Hours Worked 2019")
# pp_df = pp.to_html()

# f = open("Hours worked 2019.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

Grabbing the year (REF_DATE) from 2010, 2013, 2016, and 2019 individually for "Number of jobs".

In [57]:
# 2010 - 2012
df_NumOfJob_2010 = df_NumOfJob.loc[
    (df_NumOfJob['REF_DATE'] == 2010) |
    (df_NumOfJob['REF_DATE'] == 2011) |
    (df_NumOfJob['REF_DATE'] == 2012)
]

grouped = df_NumOfJob_2010.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                 sum  size
REF_DATE                  
2010      59793067.0  1224
2011      61122016.0  1224
2012      61949430.0  1224


In [58]:
print("Grabbing the data from 2013, 2016, and 2019.")

# 2013 - 2015
df_NumOfJob_2013 = df_NumOfJob.loc[
    (df_NumOfJob['REF_DATE'] == 2013) |
    (df_NumOfJob['REF_DATE'] == 2014) |
    (df_NumOfJob['REF_DATE'] == 2015)
]

# 2016 - 2018
df_NumOfJob_2016 = df_NumOfJob.loc[
    (df_NumOfJob['REF_DATE'] == 2016) |
    (df_NumOfJob['REF_DATE'] == 2017) |
    (df_NumOfJob['REF_DATE'] == 2018)
]

# 2019 - 2021
df_NumOfJob_2019 = df_NumOfJob.loc[
    (df_NumOfJob['REF_DATE'] == 2019) |
    (df_NumOfJob['REF_DATE'] == 2020) |
    (df_NumOfJob['REF_DATE'] == 2021)
]

grouped = df_NumOfJob_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_NumOfJob_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_NumOfJob_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

Grabbing the data from 2013, 2016, and 2019.
                 sum  size
REF_DATE                  
2013      63161481.0  1224
2014      63980381.0  1224
2015      65239031.0  1224
                 sum  size
REF_DATE                  
2016      66578386.0  1224
2017      66979678.0  1224
2018      68215341.0  1224
                 sum  size
REF_DATE                  
2019      69702012.0  1224
2020      67249987.0  1224
2021      70971515.0  1224


Panda Profiling for year 2013, 2016, and 2019for "Number of jobs".

In [59]:
# pp = ProfileReport(df_NumOfJob_2013, title="Number of jobs 2013")
# pp_df = pp.to_html()

# f = open("Number of jobs 2013.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_NumOfJob_2016, title="Number of jobs 2016")
# pp_df = pp.to_html()

# f = open("Number of jobs 2016.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_NumOfJob_2019, title="Number of jobs 2019")
# pp_df = pp.to_html()

# f = open("Number of jobs 2019.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

Grabbing the year (REF_DATE) from 2010, 2013, 2016, and 2019 individually for "Wages and Salaries".

In [60]:
# 2010 - 2012
df_WagesAndSalaries_2010 = df_WagesAndSalaries.loc[
    (df_WagesAndSalaries['REF_DATE'] == 2010) |
    (df_WagesAndSalaries['REF_DATE'] == 2011) |
    (df_WagesAndSalaries['REF_DATE'] == 2012)
]

grouped = df_WagesAndSalaries_2010.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                sum  size
REF_DATE                 
2010      2489681.0  1224
2011      2604921.0  1224
2012      2687510.0  1224


In [61]:
print("Grabbing the data from 2013, 2016, and 2019.")

# 2013 - 2015
df_WagesAndSalaries_2013 = df_WagesAndSalaries.loc[
    (df_WagesAndSalaries['REF_DATE'] == 2013) |
    (df_WagesAndSalaries['REF_DATE'] == 2014) |
    (df_WagesAndSalaries['REF_DATE'] == 2015)
]

# 2016 - 2018
df_WagesAndSalaries_2016 = df_WagesAndSalaries.loc[
    (df_WagesAndSalaries['REF_DATE'] == 2016) |
    (df_WagesAndSalaries['REF_DATE'] == 2017) |
    (df_WagesAndSalaries['REF_DATE'] == 2018)
]

# 2019 - 2021
df_WagesAndSalaries_2019 = df_WagesAndSalaries.loc[
    (df_WagesAndSalaries['REF_DATE'] == 2019) |
    (df_WagesAndSalaries['REF_DATE'] == 2020) |
    (df_WagesAndSalaries['REF_DATE'] == 2021)
]

grouped = df_WagesAndSalaries_2013.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_WagesAndSalaries_2016.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = df_WagesAndSalaries_2019.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

Grabbing the data from 2013, 2016, and 2019.
                sum  size
REF_DATE                 
2013      2746199.0  1224
2014      2844225.0  1224
2015      2965770.0  1224
                sum  size
REF_DATE                 
2016      3043217.0  1224
2017      3118809.0  1224
2018      3263527.0  1224
                sum  size
REF_DATE                 
2019      3437480.0  1224
2020      3532888.0  1224
2021      3764252.0  1224


Panda Profiling for year 2013, 2016, and 2019 for "Wages and Salaries".

In [62]:
# pp = ProfileReport(df_WagesAndSalaries_2013, title="Wages and Salaries 2013")
# pp_df = pp.to_html()

# f = open("Wages and Salaries 2013.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_WagesAndSalaries_2016, title="Wages and Salaries 2016")
# pp_df = pp.to_html()

# f = open("Wages and Salaries 2016.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(df_WagesAndSalaries_2019, title="Wages and Salaries 2019")
# pp_df = pp.to_html()

# f = open("Wages and Salaries 2019.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

For clarify, there will be a new directory that stored the result in file based on Training/Testing dataset.


In [63]:
CreatedTheFile = toOrganizedOutputFiles('Result_By_Testing_Training')

Directory 'Result_By_Testing_Training' is ALREADY created


<h3> Part 6 - Division between training and testing dataset </h3>
Training is about 60-65% and Testing is about 40-45% of dataset.<br />
Training (2013-2018), Testing (2019-2021).

Training dataset going to be after 2013 to before 2018.<br />
Testing dataset going to be after 2019.<br />
Analysis is still finishing year of 2013 to 2018 though.<br />
I have decided not to use train_test_split method because I have divided dataset by the year.<br />
Instead I have divide it manually.

In [64]:
# https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/

# This will be used if I were to use train_test_split given.

# from sklearn.model_selection import train_test_split

# train, test = train_test_split(dataset, ...)
# x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)    

In [65]:
# Average annual hours worked
# Use 2013-2015, 2016-2018 as training set
# Use 2019-2021 as testing set

frames = [df_AvgAnnHrsWrk_2013, df_AvgAnnHrsWrk_2016]
training_df_AvgAnnHrsWrk = pd.concat(frames)
testing_df_AvgAnnHrsWrk = df_AvgAnnHrsWrk_2019.copy()

grouped = training_df_AvgAnnHrsWrk.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = testing_df_AvgAnnHrsWrk.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                sum  size
REF_DATE                 
2013      1906851.0  1224
2014      1898034.0  1224
2015      1907286.0  1224
2016      1899738.0  1224
2017      1881389.0  1224
2018      1894227.0  1224
                sum  size
REF_DATE                 
2019      1894126.0  1224
2020      1873544.0  1224
2021      1901826.0  1224


In [66]:
# pp = ProfileReport(training_df_AvgAnnHrsWrk, title="Average annual hours worked Training Datset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average annual hours worked Training Dataaset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(testing_df_AvgAnnHrsWrk, title="Average annual hours worked Testing Dataset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average annual hours worked Testing Dataset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

In [67]:
# Average annual wages and salaries

frames = [df_AvgAnnWages_2013, df_AvgAnnWages_2016]
training_df_AvgAnnWages = pd.concat(frames)
testing_df_AvgAnnWages = df_AvgAnnWages_2019.copy()

grouped = training_df_AvgAnnWages.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = testing_df_AvgAnnWages.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                 sum  size
REF_DATE                  
2013      50598135.0  1224
2014      51805889.0  1224
2015      52715143.0  1224
2016      53166285.0  1224
2017      53965359.0  1224
2018      55525920.0  1224
                 sum  size
REF_DATE                  
2019      56997121.0  1224
2020      60597775.0  1224
2021      60687848.0  1224


In [68]:
# pp = ProfileReport(training_df_AvgAnnHrsWrk, title="Average annual wages and salaries Training Datset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average annual wages and salaries Training Dataaset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(testing_df_AvgAnnHrsWrk, title="Average annual wages and salaries Testing Dataset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average annual wages and salaries Testing Dataset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

In [69]:
# Average hourly wage

frames = [df_AvgHrsWages_2013, df_AvgHrsWages_2016]
training_df_AvgHrsWages = pd.concat(frames)
testing_df_AvgHrsWages = df_AvgHrsWages_2019.copy()

grouped = training_df_AvgHrsWages.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = testing_df_AvgHrsWages.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

               sum  size
REF_DATE                
2013      31932.19  1224
2014      32916.44  1224
2015      33362.79  1224
2016      33756.84  1224
2017      34571.41  1224
2018      35356.16  1224
               sum  size
REF_DATE                
2019      36293.34  1224
2020      38980.38  1224
2021      38514.73  1224


In [70]:
# pp = ProfileReport(training_df_AvgAnnHrsWrk, title="Average hourly wage Training Datset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average weekly hours worked Training Dataaset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(testing_df_AvgAnnHrsWrk, title="Average hourly wage Testing Dataset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average weekly hours worked Testing Dataset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

In [71]:
# Average weekly hours worked

frames = [df_AvgWeekHrsWrked_2013, df_AvgWeekHrsWrked_2016]
training_df_AvgWeekHrsWrked = pd.concat(frames)
testing_df_AvgWeekHrsWrked = df_AvgWeekHrsWrked_2019.copy()

grouped = training_df_AvgWeekHrsWrked.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = testing_df_AvgWeekHrsWrked.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

              sum  size
REF_DATE               
2013      36671.0  1224
2014      36483.0  1224
2015      36678.0  1224
2016      36541.0  1224
2017      36170.0  1224
2018      36416.0  1224
              sum  size
REF_DATE               
2019      36400.0  1224
2020      36036.0  1224
2021      36555.0  1224


In [72]:
# pp = ProfileReport(training_df_AvgAnnHrsWrk, title="Average weekly hours worked Training Datset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average weekly hours worked Training Dataaset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(testing_df_AvgAnnHrsWrk, title="Average weekly hours worked Testing Dataset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Average weekly hours worked Testing Dataset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

In [73]:
# Hours Worked

frames = [df_Hrs_Wrked_2013, df_Hrs_Wrked_2016]
training_df_Hrs_Wrked = pd.concat(frames)
testing_df_Hrs_Wrked = df_Hrs_Wrked_2019.copy()

grouped = training_df_Hrs_Wrked.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = testing_df_Hrs_Wrked.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                  sum  size
REF_DATE                   
2013       98935086.0  1224
2014       99777902.0  1224
2015      101927894.0  1224
2016      103980992.0  1224
2017      103906357.0  1224
2018      106885614.0  1224
                  sum  size
REF_DATE                   
2019      108384508.0  1224
2020      103897606.0  1224
2021      112280627.0  1224


In [74]:
# pp = ProfileReport(training_df_AvgAnnHrsWrk, title="Hours Worked Training Datset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Hours Worked Training Dataaset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(testing_df_AvgAnnHrsWrk, title="Hours Worked Testing Dataset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Hours Worked Testing Dataset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

In [75]:
# Number of jobs

frames = [df_NumOfJob_2013, df_NumOfJob_2016]
training_df_NumOfJob = pd.concat(frames)
testing_df_NumOfJob = df_NumOfJob_2019.copy()

grouped = training_df_NumOfJob.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = testing_df_NumOfJob.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                 sum  size
REF_DATE                  
2013      63161481.0  1224
2014      63980381.0  1224
2015      65239031.0  1224
2016      66578386.0  1224
2017      66979678.0  1224
2018      68215341.0  1224
                 sum  size
REF_DATE                  
2019      69702012.0  1224
2020      67249987.0  1224
2021      70971515.0  1224


In [76]:
# pp = ProfileReport(training_df_AvgAnnHrsWrk, title="Number of jobs Training Datset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Number of jobs Training Dataaset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(testing_df_AvgAnnHrsWrk, title="Number of jobs Testing Dataset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Number of jobs Testing Dataset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

In [77]:
# Wages and Salaries

frames = [df_WagesAndSalaries_2013, df_WagesAndSalaries_2016]
training_df_WagesAndSalaries = pd.concat(frames)
testing_df_WagesAndSalaries = df_WagesAndSalaries_2019.copy()

grouped = training_df_WagesAndSalaries.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

grouped = testing_df_WagesAndSalaries.groupby(['REF_DATE'])
print(grouped['VALUE'].agg([np.sum, np.size]))

                sum  size
REF_DATE                 
2013      2746199.0  1224
2014      2844225.0  1224
2015      2965770.0  1224
2016      3043217.0  1224
2017      3118809.0  1224
2018      3263527.0  1224
                sum  size
REF_DATE                 
2019      3437480.0  1224
2020      3532888.0  1224
2021      3764252.0  1224


In [78]:
# pp = ProfileReport(training_df_AvgAnnHrsWrk, title="Wages and Salaries Training Datset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Wages and Salaries Training Dataaset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

# pp = ProfileReport(testing_df_AvgAnnHrsWrk, title="Wages and Salaries Testing Dataset")
# pp_df = pp.to_html()

# f = open("Result_By_Testing_Training/Wages and Salaries Testing Dataset.html", "a")  # Expert into html file without modifying any columns in dataset.
# f.write(pp_df)
# f.close()

Final output for Average annual hours worked

In [79]:
class Target_To_Analysis:

    def __init__(self, df, pd, np, pp, sns, year):
      self.dfa_Target_To_Analysis = df
      self.year = year
      self.pd = pd
      self.np = np
      self.pp = pp
      self.sns = sns
 
    # create a function
    def print_result(self):
      n = 0
      for df_Target_To_Analysis in self.dfa_Target_To_Analysis:
            grouped = df_Target_To_Analysis.groupby(['Characteristics'])
            print(self.year[n])
            print(grouped['VALUE'].agg([np.sum, np.mean, np.min, np.median, np.max, np.size]))
            print("Overall,")
            print("Sum : ",np.sum(df_Target_To_Analysis['VALUE']))
            print("Mean : ",np.mean(df_Target_To_Analysis['VALUE']))
            print("Min/median/max :",np.min(df_Target_To_Analysis['VALUE']),"/",
                  np.median(df_Target_To_Analysis['VALUE']),"/",
                  np.max(df_Target_To_Analysis['VALUE']))
            print("Standard Deviation : ",np.std(df_Target_To_Analysis['VALUE']))
            print("Skewnewss : ",df_Target_To_Analysis['VALUE'].skew())
            print("Total size : ",len(df_Target_To_Analysis.index))
            print()
            n = n + 1
    
    def print_histogram(self, n):
      sns.displot(data=self.dfa_Target_To_Analysis[int(n)], x="VALUE", kind="hist", bins = 100, aspect = 1.5)
      plt.show()

In [80]:
dfa_Target_To_Analysis = [training_df_AvgAnnHrsWrk, testing_df_AvgAnnHrsWrk]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set','testing set'])
dfa_Target_To_Analysis.print_result()

training set
                                        sum         mean    amin  median  \
Characteristics                                                            
15 to 24 years                     348240.0   906.875000   462.0   916.0   
25 to 34 years                     677949.0  1614.164286  1236.0  1604.5   
35 to 44 years                     750842.0  1787.719048  1436.0  1760.5   
45 to 54 years                     773872.0  1842.552381  1523.0  1829.0   
55 to 64 years                     703129.0  1674.116667  1390.0  1679.0   
65 years old and over              413498.0  1076.817708   701.0  1073.5   
College diploma                    699642.0  1665.814286  1408.0  1647.0   
Female employees                   643890.0  1533.071429  1231.0  1538.0   
High school diploma and less       554052.0  1319.171429  1065.0  1317.0   
Immigrant employees                648482.0  1637.580808  1333.0  1600.0   
Indigenous identity employees      582295.0  1470.441919  1057.0  1492.0   

In [81]:
print("Histogram for training dataset") # testing dataset")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset")
dfa_Target_To_Analysis.print_histogram(1)

Histogram for training dataset
Histgram for testing dataset


Final output for "Average annual wages and salaries"

In [82]:
dfa_Target_To_Analysis = [training_df_AvgAnnWages, testing_df_AvgAnnWages]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set','testing set'])
dfa_Target_To_Analysis.print_result()

training set
                                          sum          mean     amin   median  \
Characteristics                                                                 
15 to 24 years                      5844745.0  15220.690104   8769.0  14305.5   
25 to 34 years                     16279492.0  38760.695238  23433.0  37717.0   
35 to 44 years                     22063695.0  52532.607143  33680.0  51506.0   
45 to 54 years                     23873488.0  56841.638095  30761.0  55972.0   
55 to 64 years                     22698208.0  54043.352381  28716.0  51361.5   
65 years old and over              12199299.0  31769.007812  14168.0  29624.0   
College diploma                    18964213.0  45152.888095  26342.0  42893.0   
Female employees                   17243175.0  41055.178571  23619.0  39648.0   
High school diploma and less       12141489.0  28908.307143  13589.0  27359.0   
Immigrant employees                17912206.0  45232.843434  19120.0  43440.0   
Indigenous iden

In [83]:
print("Histogram for training dataset") # testing dataset")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset")
dfa_Target_To_Analysis.print_histogram(1)

Histogram for training dataset
Histgram for testing dataset


Final output for "Average hourly wage"

In [84]:
dfa_Target_To_Analysis = [training_df_AvgHrsWages, testing_df_AvgHrsWages]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set','testing set'])
dfa_Target_To_Analysis.print_result()

training set
                                        sum       mean   amin  median   amax  \
Characteristics                                                                
15 to 24 years                      6394.37  16.652005  11.70  15.830  33.72   
25 to 34 years                     10151.74  24.170810  15.07  23.265  47.75   
35 to 44 years                     12429.33  29.593643  17.22  29.065  54.92   
45 to 54 years                     13001.51  30.955976  16.89  30.625  56.39   
55 to 64 years                     13540.12  32.238381  16.73  30.490  63.53   
65 years old and over              11253.83  29.306849  15.90  27.780  69.41   
College diploma                    11423.95  27.199881  15.59  25.745  55.36   
Female employees                   11253.54  26.794143  15.83  25.925  50.72   
High school diploma and less        9141.89  21.766405  12.76  20.985  38.43   
Immigrant employees                10867.07  27.442096  13.35  25.855  57.56   
Indigenous identity employe

In [85]:
print("Histogram for training dataset") # testing dataset")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset")
dfa_Target_To_Analysis.print_histogram(1)

Histogram for training dataset
Histgram for testing dataset


Final output for "Average weekly hours worked"

In [86]:
dfa_Target_To_Analysis = [training_df_AvgWeekHrsWrked, testing_df_AvgWeekHrsWrked]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set','testing set'])
dfa_Target_To_Analysis.print_result()

training set
                                       sum       mean  amin  median  amax  \
Characteristics                                                             
15 to 24 years                      6701.0  17.450521   9.0    18.0  23.0   
25 to 34 years                     13045.0  31.059524  24.0    31.0  38.0   
35 to 44 years                     14441.0  34.383333  28.0    34.0  42.0   
45 to 54 years                     14869.0  35.402381  29.0    35.0  44.0   
55 to 64 years                     13528.0  32.209524  27.0    32.0  38.0   
65 years old and over               7948.0  20.697917  13.0    21.0  29.0   
College diploma                    13453.0  32.030952  27.0    32.0  39.0   
Female employees                   12380.0  29.476190  24.0    30.0  33.0   
High school diploma and less       10649.0  25.354762  20.0    25.0  32.0   
Immigrant employees                12463.0  31.472222  26.0    31.0  48.0   
Indigenous identity employees      11198.0  28.277778  20.0    

In [87]:
print("Histogram for training dataset") # testing dataset")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset")
dfa_Target_To_Analysis.print_histogram(1)

Histogram for training dataset
Histgram for testing dataset


Final output for "Hours Worked"

In [88]:
dfa_Target_To_Analysis = [training_df_Hrs_Wrked, testing_df_Hrs_Wrked]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set','testing set'])
dfa_Target_To_Analysis.print_result()

training set
                                          sum           mean   amin   median  \
Characteristics                                                                
15 to 24 years                      6319354.0   16456.651042    6.0   4000.0   
25 to 34 years                     21892318.0   52124.566667   32.0   9676.5   
35 to 44 years                     23526930.0   56016.500000   32.0  10443.5   
45 to 54 years                     26229128.0   62450.304762   29.0  10857.5   
55 to 64 years                     20321678.0   48384.947619   20.0   8825.5   
65 years old and over               4298529.0   11194.085938   26.0   2495.5   
College diploma                    27470020.0   65404.809524   30.0  10379.5   
Female employees                   69941334.0  166526.985714   57.0  27674.0   
High school diploma and less       18744590.0   44629.976190   54.0  11567.5   
Immigrant employees                25508532.0   64415.484848   41.0   6476.5   
Indigenous identity employe

In [89]:
print("Histogram for training dataset") # testing dataset")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset")
dfa_Target_To_Analysis.print_histogram(1)

Histogram for training dataset
Histgram for testing dataset


Final output for "Number of jobs"

In [90]:
dfa_Target_To_Analysis = [training_df_NumOfJob, testing_df_NumOfJob]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set','testing set'])
dfa_Target_To_Analysis.print_result()

training set
                                          sum           mean   amin   median  \
Characteristics                                                                
15 to 24 years                      7043629.0   18342.783854   13.0   4215.0   
25 to 34 years                     13985914.0   33299.795238   17.0   5885.0   
35 to 44 years                     13654580.0   32510.904762   18.0   5906.0   
45 to 54 years                     14554094.0   34652.604762   14.0   5993.0   
55 to 64 years                     12313508.0   29317.876190   11.0   5355.0   
65 years old and over               4150652.0   10808.989583   18.0   2306.5   
College diploma                    17030502.0   40548.814286   16.0   6147.5   
Female employees                   45510332.0  108357.933333   35.0  18345.5   
High school diploma and less       14427614.0   34351.461905   36.0   8041.5   
Immigrant employees                16175470.0   40847.146465   19.0   3842.0   
Indigenous identity employe

In [91]:
print("Histogram for training dataset") # testing dataset")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset")
dfa_Target_To_Analysis.print_histogram(1)

Histogram for training dataset
Histgram for testing dataset


Final output for "Wages and Salaries"

In [92]:
dfa_Target_To_Analysis = [training_df_WagesAndSalaries, testing_df_WagesAndSalaries]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set','testing set'])
dfa_Target_To_Analysis.print_result()

training set
                                         sum         mean  amin  median  \
Characteristics                                                           
15 to 24 years                      108973.0   283.783854   0.0    63.0   
25 to 34 years                      538547.0  1282.254762   1.0   211.0   
35 to 44 years                      710804.0  1692.390476   1.0   252.5   
45 to 54 years                      842807.0  2006.683333   1.0   285.5   
55 to 64 years                      657465.0  1565.392857   1.0   228.0   
65 years old and over               138783.0   361.414062   1.0    66.5   
College diploma                     734205.0  1748.107143   1.0   232.5   
Female employees                   1953878.0  4652.090476   2.0   592.5   
High school diploma and less        399426.0   951.014286   2.0   232.0   
Immigrant employees                 749325.0  1892.234848   1.0   172.5   
Indigenous identity employees       104776.0   264.585859   1.0    77.0   
Male employe

In [93]:
print("Histogram for training dataset")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset")
dfa_Target_To_Analysis.print_histogram(1)

Histogram for training dataset
Histgram for testing dataset


<h3> Cha Squared result </h3>

All of the Panda datasets Analysis<br />
All of these data are analysis by chi-square test.<br />
The data I want to analysis this point are all categorical.<br />
Two columns that are used are "REF_DATE" which is used to split are "GEO" and "Characteristics".

In [94]:
# https://www.geeksforgeeks.org/python-pearsons-chi-square-test/
# https://www.geeksforgeeks.org/contingency-table-in-python/
# https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/
# https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/


# from scipy.stats import chi2_contingency

class ChiSquareAnalysisByYear:

    def __init__(self, df, classify, pd, np):

        self.data_crosstab = pd.crosstab(df[classify], 
                            df['REF_DATE'], 
                               margins = False) 
        # print(self.data_crosstab) 

    def displayCrosstab(self):
        # Display/OUtput whole corsstab table
        print(self.data_crosstab)

    def returnCrosstab(self):
        # Return whole crosstab table itself
        return self.data_crosstab

    def doChiSquareAnalysis(self):

        # defining the table
        data = self.data_crosstab # [[207, 282, 241], [234, 242, 232]]
        stat, p, dof, expected = chi2_contingency(data)

        # interpret p-value
        alpha = 0.05
        print("p value is " + str(p))
        if p <= alpha:
            print('Dependent (reject H0)')
        else:
            print('Independent (H0 holds true)')

In [95]:
# Used by Chi-Square methods
# Used by spliting of training and testing dataset
# I did the analysis of training dataset anyway but only will be use testing dataset.
# Commented the analysis divided by year

df_analysis = [training_df_AvgAnnHrsWrk, testing_df_AvgAnnHrsWrk]

# First one is Training set. Second one is Testing set.
for x in df_analysis:
    data_crosstab_char = ChiSquareAnalysisByYear(x,'Characteristics', pd, np)
    data_crosstab_char.displayCrosstab()
    data_crosstab_char.doChiSquareAnalysis()

    data_crosstab_province = ChiSquareAnalysisByYear(x,'GEO', pd, np)
    data_crosstab_province.displayCrosstab()
    data_crosstab_province.doChiSquareAnalysis()

REF_DATE                           2013  2014  2015  2016  2017  2018
Characteristics                                                      
15 to 24 years                       64    64    64    64    64    64
25 to 34 years                       70    70    70    70    70    70
35 to 44 years                       70    70    70    70    70    70
45 to 54 years                       70    70    70    70    70    70
55 to 64 years                       70    70    70    70    70    70
65 years old and over                64    64    64    64    64    64
College diploma                      70    70    70    70    70    70
Female employees                     70    70    70    70    70    70
High school diploma and less         70    70    70    70    70    70
Immigrant employees                  66    66    66    66    66    66
Indigenous identity employees        66    66    66    66    66    66
Male employees                       70    70    70    70    70    70
Non-immigrant employ

In [96]:
# Used by Chi-Square methods

df_analysis = [training_df_AvgAnnHrsWrk, testing_df_AvgAnnHrsWrk]

# First one is Training set. Second one is Testing set.
for x in df_analysis:
    data_crosstab_char = ChiSquareAnalysisByYear(x,'Characteristics', pd, np)
    data_crosstab_char.displayCrosstab()
    data_crosstab_char.doChiSquareAnalysis()

    data_crosstab_province = ChiSquareAnalysisByYear(x,'GEO', pd, np)
    data_crosstab_province.displayCrosstab()
    data_crosstab_province.doChiSquareAnalysis()

REF_DATE                           2013  2014  2015  2016  2017  2018
Characteristics                                                      
15 to 24 years                       64    64    64    64    64    64
25 to 34 years                       70    70    70    70    70    70
35 to 44 years                       70    70    70    70    70    70
45 to 54 years                       70    70    70    70    70    70
55 to 64 years                       70    70    70    70    70    70
65 years old and over                64    64    64    64    64    64
College diploma                      70    70    70    70    70    70
Female employees                     70    70    70    70    70    70
High school diploma and less         70    70    70    70    70    70
Immigrant employees                  66    66    66    66    66    66
Indigenous identity employees        66    66    66    66    66    66
Male employees                       70    70    70    70    70    70
Non-immigrant employ

In [97]:
df_analysis = [training_df_AvgHrsWages, testing_df_AvgHrsWages]

# First one is Training set. Second one is Testing set.
for x in df_analysis:
    data_crosstab_char = ChiSquareAnalysisByYear(x,'Characteristics', pd, np)
    data_crosstab_char.displayCrosstab()
    data_crosstab_char.doChiSquareAnalysis()

    data_crosstab_province = ChiSquareAnalysisByYear(x,'GEO', pd, np)
    data_crosstab_province.displayCrosstab()
    data_crosstab_province.doChiSquareAnalysis()

REF_DATE                           2013  2014  2015  2016  2017  2018
Characteristics                                                      
15 to 24 years                       64    64    64    64    64    64
25 to 34 years                       70    70    70    70    70    70
35 to 44 years                       70    70    70    70    70    70
45 to 54 years                       70    70    70    70    70    70
55 to 64 years                       70    70    70    70    70    70
65 years old and over                64    64    64    64    64    64
College diploma                      70    70    70    70    70    70
Female employees                     70    70    70    70    70    70
High school diploma and less         70    70    70    70    70    70
Immigrant employees                  66    66    66    66    66    66
Indigenous identity employees        66    66    66    66    66    66
Male employees                       70    70    70    70    70    70
Non-immigrant employ

In [98]:
df_analysis = [training_df_AvgWeekHrsWrked, testing_df_AvgWeekHrsWrked]
 
# First one is Training set. Second one is Testing set.
for x in df_analysis:
    data_crosstab_char = ChiSquareAnalysisByYear(x,'Characteristics', pd, np)
    data_crosstab_char.displayCrosstab()
    data_crosstab_char.doChiSquareAnalysis()

    data_crosstab_province = ChiSquareAnalysisByYear(x,'GEO', pd, np)
    data_crosstab_province.displayCrosstab()
    data_crosstab_province.doChiSquareAnalysis()

REF_DATE                           2013  2014  2015  2016  2017  2018
Characteristics                                                      
15 to 24 years                       64    64    64    64    64    64
25 to 34 years                       70    70    70    70    70    70
35 to 44 years                       70    70    70    70    70    70
45 to 54 years                       70    70    70    70    70    70
55 to 64 years                       70    70    70    70    70    70
65 years old and over                64    64    64    64    64    64
College diploma                      70    70    70    70    70    70
Female employees                     70    70    70    70    70    70
High school diploma and less         70    70    70    70    70    70
Immigrant employees                  66    66    66    66    66    66
Indigenous identity employees        66    66    66    66    66    66
Male employees                       70    70    70    70    70    70
Non-immigrant employ

In [99]:
df_analysis = [training_df_Hrs_Wrked, testing_df_Hrs_Wrked]

# First one is Training set. Second one is Testing set.    
for x in df_analysis:
    data_crosstab_char = ChiSquareAnalysisByYear(x,'Characteristics', pd, np)
    data_crosstab_char.displayCrosstab()
    data_crosstab_char.doChiSquareAnalysis()

    data_crosstab_province = ChiSquareAnalysisByYear(x,'GEO', pd, np)
    data_crosstab_province.displayCrosstab()
    data_crosstab_province.doChiSquareAnalysis()

REF_DATE                           2013  2014  2015  2016  2017  2018
Characteristics                                                      
15 to 24 years                       64    64    64    64    64    64
25 to 34 years                       70    70    70    70    70    70
35 to 44 years                       70    70    70    70    70    70
45 to 54 years                       70    70    70    70    70    70
55 to 64 years                       70    70    70    70    70    70
65 years old and over                64    64    64    64    64    64
College diploma                      70    70    70    70    70    70
Female employees                     70    70    70    70    70    70
High school diploma and less         70    70    70    70    70    70
Immigrant employees                  66    66    66    66    66    66
Indigenous identity employees        66    66    66    66    66    66
Male employees                       70    70    70    70    70    70
Non-immigrant employ

In [100]:
df_analysis = [training_df_NumOfJob, testing_df_NumOfJob]

# First one is Training set. Second one is Testing set. 
for x in df_analysis:
    data_crosstab_char = ChiSquareAnalysisByYear(x,'Characteristics', pd, np)
    data_crosstab_char.displayCrosstab()
    data_crosstab_char.doChiSquareAnalysis()

    data_crosstab_province = ChiSquareAnalysisByYear(x,'GEO', pd, np)
    data_crosstab_province.displayCrosstab()
    data_crosstab_province.doChiSquareAnalysis()

REF_DATE                           2013  2014  2015  2016  2017  2018
Characteristics                                                      
15 to 24 years                       64    64    64    64    64    64
25 to 34 years                       70    70    70    70    70    70
35 to 44 years                       70    70    70    70    70    70
45 to 54 years                       70    70    70    70    70    70
55 to 64 years                       70    70    70    70    70    70
65 years old and over                64    64    64    64    64    64
College diploma                      70    70    70    70    70    70
Female employees                     70    70    70    70    70    70
High school diploma and less         70    70    70    70    70    70
Immigrant employees                  66    66    66    66    66    66
Indigenous identity employees        66    66    66    66    66    66
Male employees                       70    70    70    70    70    70
Non-immigrant employ

In [101]:
df_analysis = [training_df_WagesAndSalaries, testing_df_WagesAndSalaries]

# First one is Training set. Second one is Testing set.    
for x in df_analysis:
    data_crosstab_char = ChiSquareAnalysisByYear(x,'Characteristics', pd, np)
    data_crosstab_char.displayCrosstab()
    data_crosstab_char.doChiSquareAnalysis()

    data_crosstab_province = ChiSquareAnalysisByYear(x,'GEO', pd, np)
    data_crosstab_province.displayCrosstab()
    data_crosstab_province.doChiSquareAnalysis()

REF_DATE                           2013  2014  2015  2016  2017  2018
Characteristics                                                      
15 to 24 years                       64    64    64    64    64    64
25 to 34 years                       70    70    70    70    70    70
35 to 44 years                       70    70    70    70    70    70
45 to 54 years                       70    70    70    70    70    70
55 to 64 years                       70    70    70    70    70    70
65 years old and over                64    64    64    64    64    64
College diploma                      70    70    70    70    70    70
Female employees                     70    70    70    70    70    70
High school diploma and less         70    70    70    70    70    70
Immigrant employees                  66    66    66    66    66    66
Indigenous identity employees        66    66    66    66    66    66
Male employees                       70    70    70    70    70    70
Non-immigrant employ

Export csv file that are splited between training and testing set.

In [102]:
# Save the dataframe to a CSV file

# df_AvgAnnHrsWrk # Average annual hours worked
df_AvgAnnHrsWrk_2013.to_csv('Result_By_Testing_Training/df_AvgAnnHrsWrk_2013.csv', index=False) # Average annual hours worked in 2017
df_AvgAnnHrsWrk_2016.to_csv('Result_By_Testing_Training/df_AvgAnnHrsWrk_2016.csv', index=False) #                                2019
df_AvgAnnHrsWrk_2019.to_csv('Result_By_Testing_Training/df_AvgAnnHrsWrk_2019.csv', index=False) #                                2021


# df_AvgAnnWages # Average annual wages and salaries
df_AvgAnnWages_2013.to_csv('Result_By_Testing_Training/df_AvgAnnWages_2013.csv', index=False) # Average annual hours worked in 2017
df_AvgAnnWages_2016.to_csv('Result_By_Testing_Training/df_AvgAnnWages_2016.csv', index=False) #                                2019
df_AvgAnnWages_2019.to_csv('Result_By_Testing_Training/df_AvgAnnWages_2019.csv', index=False) #                                2021

# df_AvgHrsWages # Average hourly wage
df_AvgHrsWages_2013.to_csv('Result_By_Testing_Training/df_AvgHrsWages_2013.csv', index=False) # Average annual hours worked in 2017
df_AvgHrsWages_2016.to_csv('Result_By_Testing_Training/df_AvgHrsWages_2016.csv', index=False) #                                2019
df_AvgHrsWages_2019.to_csv('Result_By_Testing_Training/df_AvgHrsWages_2019.csv', index=False) #                                2021

# df_AvgWeekHrsWrked # Average weekly hours worked
df_AvgWeekHrsWrked_2013.to_csv('Result_By_Testing_Training/df_AvgWeekHrsWrked_2013.csv', index=False) # Average annual hours worked in 2017
df_AvgWeekHrsWrked_2016.to_csv('Result_By_Testing_Training/df_AvgWeekHrsWrked_2016.csv', index=False) #                                2019
df_AvgWeekHrsWrked_2019.to_csv('Result_By_Testing_Training/df_AvgWeekHrsWrked_2019.csv', index=False) #                                2021

# df_Hrs_Wrked # Hours Worked
df_Hrs_Wrked_2013.to_csv('Result_By_Testing_Training/df_Hrs_Wrked_2013.csv', index=False) # Average annual hours worked in 2017
df_Hrs_Wrked_2016.to_csv('Result_By_Testing_Training/df_Hrs_Wrked_2016.csv', index=False) #                                2019
df_Hrs_Wrked_2019.to_csv('Result_By_Testing_Training/df_Hrs_Wrked_2019.csv', index=False) #                                2021

# df_NumOfJob # Number of jobs
df_NumOfJob_2013.to_csv('Result_By_Testing_Training/df_NumOfJob_2013.csv', index=False) # Average annual hours worked in 2017
df_NumOfJob_2016.to_csv('Result_By_Testing_Training/df_NumOfJob_2016.csv', index=False) #                                2019
df_NumOfJob_2019.to_csv('Result_By_Testing_Training/df_NumOfJob_2019.csv', index=False) #                                2021

# df_WagesAndSalaries # Wages and Salaries
df_WagesAndSalaries_2013.to_csv('Result_By_Testing_Training/df_WagesAndSalaries_2013.csv', index=False) # Average annual hours worked in 2017
df_WagesAndSalaries_2016.to_csv('Result_By_Testing_Training/df_WagesAndSalaries_2016.csv', index=False) #                                2019
df_WagesAndSalaries_2019.to_csv('Result_By_Testing_Training/df_WagesAndSalaries_2019.csv', index=False) #                                2021



For clarify, there will be a new directory that stored the result in file based on Characteristics Indicators.


In [103]:
CreatedTheFile = toOrganizedOutputFiles('Result_By_Characteristics')

Directory 'Result_By_Characteristics' is ALREADY created


In [104]:
# https://www.analyticsvidhya.com/blog/2022/01/different-types-of-regression-models/

# https://online.hbs.edu/blog/post/types-of-data-analysis
# https://online.hbs.edu/blog/post/descriptive-analytics
# https://online.hbs.edu/blog/post/diagnostic-analytics
# https://online.hbs.edu/blog/post/predictive-analytics
# https://chartio.com/learn/data-analytics/types-of-data-analysis/
# https://www.simplilearn.com/data-analysis-methods-process-types-article#types_of_data_analysis

# https://builtin.com/data-science/types-of-data-analysis
# https://careerfoundry.com/en/blog/data-analytics/different-types-of-data-analysis/


<h3> Part 7 - Division by Characteristics </h3>
Division by 'age', 'gender', 'educaiton' and 'immigrant'

For next step, I will filtered it by the following group, "age group", "gender level", "education level", "immigrant level" and "Aboriginal status" (commented).
* I decided not to use 2010-2012 mentions before.
* I have analysis both training and testing set. (First one is training (2013-2018) and second one is testing, 2019-2021)
* Originally it was, (2010-2012, dropped), (2013-2015), (2016-2018), (2019-2021)
* There's also other characteristics there as well but I decided to drop them as well.

Filtered for "Average annual hours worked" by following: "Age group", "Gender level", "Education level", and "Immigration status".<br />
"Aboriginal status" has been commented.

In [105]:
# Dataset by training set inside Average Annual Hours Worked

print("\nAge group")
training_df_AvgAnnHrsWrk_ByAge = training_df_AvgAnnHrsWrk.loc[
    (training_df_AvgAnnHrsWrk['Characteristics'] == '15 to 24 years') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == '25 to 34 years') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == '35 to 44 years') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == '45 to 54 years') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == '55 to 64 years') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == '65 years old and over')]
# print(training_df_AvgAnnHrsWrk_ByAge.head(20))
grouped = training_df_AvgAnnHrsWrk_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(training_df_AvgAnnHrsWrk_ByAge.index))

print("\nGender group")
training_df_AvgAnnHrsWrk_ByGender = training_df_AvgAnnHrsWrk.loc[
    (training_df_AvgAnnHrsWrk['Characteristics'] == 'Female employees') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == 'Male employees')
]
# print(training_df_AvgAnnHrsWrk_ByGender.head(20))
grouped = training_df_AvgAnnHrsWrk_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(training_df_AvgAnnHrsWrk_ByGender.index))

print("\nEducation group")
training_df_AvgAnnHrsWrk_ByEducation = training_df_AvgAnnHrsWrk.loc[
    (training_df_AvgAnnHrsWrk['Characteristics'] == 'High school diploma and less') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == 'Trade certificate') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == 'University degree and higher')
]
# print(training_df_AvgAnnHrsWrk_ByEducation.head(20))
grouped = training_df_AvgAnnHrsWrk_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(training_df_AvgAnnHrsWrk_ByEducation.index))

print("\nImmigrant group")
training_df_AvgAnnHrsWrk_ByImmigrant = training_df_AvgAnnHrsWrk.loc[
    (training_df_AvgAnnHrsWrk['Characteristics'] == 'Immigrant employees') |
    (training_df_AvgAnnHrsWrk['Characteristics'] == 'Non-immigrant employees')
]
# print(training_df_AvgAnnHrsWrk_ByImmigrant.head(20))
grouped = training_df_AvgAnnHrsWrk_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(training_df_AvgAnnHrsWrk_ByImmigrant.index))

# print("\nIndigenous group")
# df_AvgAnnHrkWrk_2010_ByIndigenous = training_df_AvgAnnHrsWrk.loc[
#     (training_df_AvgAnnHrsWrk['Characteristics'] == 'Indigenous identity employees') |
#     (training_df_AvgAnnHrsWrk['Characteristics'] == 'Non-indigenous identity employees')
# ]
# print(df_AvgAnnHrkWrk_2010_ByIndigenous.head(20))
# # grouped = df_AvgAnnHrkWrk_2010_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(df_AvgAnnHrkWrk_2010_ByIndigenous.index))


Age group
                            sum  size
Characteristics                      
15 to 24 years         348240.0   384
25 to 34 years         677949.0   420
35 to 44 years         750842.0   420
45 to 54 years         773872.0   420
55 to 64 years         703129.0   420
65 years old and over  413498.0   384
Total size :  2448

Gender group
                       sum  size
Characteristics                 
Female employees  643890.0   420
Male employees    698651.0   420
Total size :  840

Education group
                                   sum  size
Characteristics                             
High school diploma and less  554052.0   420
Trade certificate             621652.0   396
University degree and higher  690886.0   396
Total size :  1212

Immigrant group
                              sum  size
Characteristics                        
Immigrant employees      648482.0   396
Non-immigrant employees  622292.0   396
Total size :  792


In [106]:
# Dataset by testing set inside Average Annual Hours Worked

print("\nAge group")
testing_df_AvgAnnHrsWrk_ByAge = testing_df_AvgAnnHrsWrk.loc[
    (testing_df_AvgAnnHrsWrk['Characteristics'] == '15 to 24 years') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == '25 to 34 years') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == '35 to 44 years') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == '45 to 54 years') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == '55 to 64 years') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == '65 years old and over')]
# print(testing_df_AvgAnnHrsWrk_ByAge.head(20))
grouped = testing_df_AvgAnnHrsWrk_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(testing_df_AvgAnnHrsWrk_ByAge.index))

print("\nGender group")
testing_df_AvgAnnHrsWrk_ByGender = testing_df_AvgAnnHrsWrk.loc[
    (testing_df_AvgAnnHrsWrk['Characteristics'] == 'Female employees') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == 'Male employees')
]
# print(testing_df_AvgAnnHrsWrk_ByGender.head(20))
grouped = testing_df_AvgAnnHrsWrk_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(testing_df_AvgAnnHrsWrk_ByGender.index))

print("\nEducation groupa")
testing_df_AvgAnnHrsWrk_ByEducation = testing_df_AvgAnnHrsWrk.loc[
    (testing_df_AvgAnnHrsWrk['Characteristics'] == 'High school diploma and less') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == 'Trade certificate') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == 'University degree and higher')
]
# print(testing_df_AvgAnnHrsWrk_ByEducation.head(20))
grouped = testing_df_AvgAnnHrsWrk_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(testing_df_AvgAnnHrsWrk_ByEducation.index))

print("\nImmigrant group")
testing_df_AvgAnnHrsWrk_ByImmigrant = testing_df_AvgAnnHrsWrk.loc[
    (testing_df_AvgAnnHrsWrk['Characteristics'] == 'Immigrant employees') |
    (testing_df_AvgAnnHrsWrk['Characteristics'] == 'Non-immigrant employees')
]
# print(testing_df_AvgAnnHrsWrk_ByImmigrant.head(20))
grouped = testing_df_AvgAnnHrsWrk_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("Total size : ",len(testing_df_AvgAnnHrsWrk_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# df_AvgAnnHrkWrk_2010_ByIndigenous = testing_df_AvgAnnHrsWrk.loc[
#     (testing_df_AvgAnnHrsWrk['Characteristics'] == 'Indigenous identity employees') |
#     (testing_df_AvgAnnHrsWrk['Characteristics'] == 'Non-indigenous identity employees')
# ]
# print(df_AvgAnnHrkWrk_2010_ByIndigenous.head(20))
# # grouped = df_AvgAnnHrkWrk_2010_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(df_AvgAnnHrkWrk_2010_ByIndigenous.index))


Age group
                            sum  size
Characteristics                      
15 to 24 years         179872.0   192
25 to 34 years         333662.0   210
35 to 44 years         371386.0   210
45 to 54 years         384987.0   210
55 to 64 years         351843.0   210
65 years old and over  206760.0   192
Total size :  1224

Gender group
                       sum  size
Characteristics                 
Female employees  323755.0   210
Male employees    343101.0   210
Total size :  420

Education groupa
                                   sum  size
Characteristics                             
High school diploma and less  275636.0   210
Trade certificate             306174.0   198
University degree and higher  341795.0   198
Total size :  606

Immigrant group
                              sum  size
Characteristics                        
Immigrant employees      318818.0   198
Non-immigrant employees  310245.0   198
Total size :  396


Filtered for "Average annual wages and salaries" by following: "Age group", "Gender level", "Education level", and "Immigration status".<br />
"Aboriginal status" has been commented.

In [107]:
# Dataset by training dataset inside Average annual wages and salaries

print("\nAge group")
training_df_AvgAnnWages_ByAge = training_df_AvgAnnWages.loc[
    (training_df_AvgAnnWages['Characteristics'] == '15 to 24 years') |
    (training_df_AvgAnnWages['Characteristics'] == '25 to 34 years') |
    (training_df_AvgAnnWages['Characteristics'] == '35 to 44 years') |
    (training_df_AvgAnnWages['Characteristics'] == '45 to 54 years') |
    (training_df_AvgAnnWages['Characteristics'] == '55 to 64 years') |
    (training_df_AvgAnnWages['Characteristics'] == '65 years old and over')]
# print(training_df_AvgAnnWages_ByAge.head(20))
grouped = training_df_AvgAnnWages_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgAnnWages_ByAge.index))

print("\nGender group")
training_df_AvgAnnWages_ByGender = training_df_AvgAnnWages.loc[
    (training_df_AvgAnnWages['Characteristics'] == 'Female employees') |
    (training_df_AvgAnnWages['Characteristics'] == 'Male employees')
]
# print(training_df_AvgAnnWages_ByGender.head(20))
grouped = training_df_AvgAnnWages_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgAnnWages_ByGender.index))

print("\nEducation group")
training_df_AvgAnnWages_ByEducation = training_df_AvgAnnWages.loc[
    (training_df_AvgAnnWages['Characteristics'] == 'High school diploma and less') |
    (training_df_AvgAnnWages['Characteristics'] == 'Trade certificate') |
    (training_df_AvgAnnWages['Characteristics'] == 'University degree and higher')
]
# print(training_df_AvgAnnWages_ByEducation.head(20))
grouped = training_df_AvgAnnWages_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgAnnWages_ByEducation.index))

print("\nImmigrant group")
training_df_AvgAnnWages_ByImmigrant = training_df_AvgAnnWages.loc[
    (training_df_AvgAnnWages['Characteristics'] == 'Immigrant employees') |
    (training_df_AvgAnnWages['Characteristics'] == 'Non-immigrant employees')
]
# print(training_df_AvgAnnWages_ByImmigrant.head(20))
grouped = training_df_AvgAnnWages_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgAnnWages_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# training_df_AvgAnnWages_ByIndigenous = training_df_AvgAnnWages.loc[
#     (training_df_AvgAnnWages['Characteristics'] == 'Indigenous identity employees') |
#     (training_df_AvgAnnWages['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(df_AvgAnnHrk_ByIndigenous.head(20))
# grouped = training_df_AvgAnnWages_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(training_df_AvgAnnWages_ByIndigenous.index))


Age group
                              sum  size
Characteristics                        
15 to 24 years          5844745.0   384
25 to 34 years         16279492.0   420
35 to 44 years         22063695.0   420
45 to 54 years         23873488.0   420
55 to 64 years         22698208.0   420
65 years old and over  12199299.0   384
The total number of this one is  2448

Gender group
                         sum  size
Characteristics                   
Female employees  17243175.0   420
Male employees    21447737.0   420
The total number of this one is  840

Education group
                                     sum  size
Characteristics                               
High school diploma and less  12141489.0   420
Trade certificate             15485493.0   396
University degree and higher  22886417.0   396
The total number of this one is  1212

Immigrant group
                                sum  size
Characteristics                          
Immigrant employees      17912206.0   396
Non-imm

In [108]:
# Dataset by testing dataset inside Average annual wages and salaries

print("\nAge group")
testing_df_AvgAnnWages_ByAge = testing_df_AvgAnnWages.loc[
    (testing_df_AvgAnnWages['Characteristics'] == '15 to 24 years') |
    (testing_df_AvgAnnWages['Characteristics'] == '25 to 34 years') |
    (testing_df_AvgAnnWages['Characteristics'] == '35 to 44 years') |
    (testing_df_AvgAnnWages['Characteristics'] == '45 to 54 years') |
    (testing_df_AvgAnnWages['Characteristics'] == '55 to 64 years') |
    (testing_df_AvgAnnWages['Characteristics'] == '65 years old and over')]
# print(testing_df_AvgAnnWages_ByAge.head(20))
grouped = testing_df_AvgAnnWages_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgAnnWages_ByAge.index))

print("\nGender group")
testing_df_AvgAnnWages_ByGender = testing_df_AvgAnnWages.loc[
    (testing_df_AvgAnnWages['Characteristics'] == 'Female employees') |
    (testing_df_AvgAnnWages['Characteristics'] == 'Male employees')
]
# print(testing_df_AvgAnnWages_ByGender.head(20))
grouped = testing_df_AvgAnnWages_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgAnnWages_ByGender.index))

print("\nEducation group")
testing_df_AvgAnnWages_ByEducation = testing_df_AvgAnnWages.loc[
    (testing_df_AvgAnnWages['Characteristics'] == 'High school diploma and less') |
    (testing_df_AvgAnnWages['Characteristics'] == 'Trade certificate') |
    (testing_df_AvgAnnWages['Characteristics'] == 'University degree and higher')
]
# print(testing_df_AvgAnnWages_ByEducation.head(20))
grouped = testing_df_AvgAnnWages_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgAnnWages_ByEducation.index))

print("\nImmigrant group")
testing_df_AvgAnnWages_ByImmigrant = testing_df_AvgAnnWages.loc[
    (testing_df_AvgAnnWages['Characteristics'] == 'Immigrant employees') |
    (testing_df_AvgAnnWages['Characteristics'] == 'Non-immigrant employees')
]
# print(testing_df_AvgAnnWages_ByImmigrant.head(20))
grouped = testing_df_AvgAnnWages_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgAnnWages_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# testing_df_AvgAnnWages_ByIndigenous = testing_df_AvgAnnWages.loc[
#     (testing_df_AvgAnnWages['Characteristics'] == 'Indigenous identity employees') |
#     (testing_df_AvgAnnWages['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(df_AvgAnnHrk_ByIndigenous.head(20))
# grouped = testing_df_AvgAnnWages_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(testing_df_AvgAnnWages_ByIndigenous.index))


Age group
                              sum  size
Characteristics                        
15 to 24 years          3482308.0   192
25 to 34 years          9108041.0   210
35 to 44 years         12296655.0   210
45 to 54 years         13595377.0   210
55 to 64 years         12592232.0   210
65 years old and over   6840988.0   192
The total number of this one is  1224

Gender group
                         sum  size
Characteristics                   
Female employees   9929829.0   210
Male employees    11755264.0   210
The total number of this one is  420

Education group
                                     sum  size
Characteristics                               
High school diploma and less   6915927.0   210
Trade certificate              8609637.0   198
University degree and higher  12581443.0   198
The total number of this one is  606

Immigrant group
                               sum  size
Characteristics                         
Immigrant employees      9838690.0   198
Non-immigra

Filtered for "Average hourly wage" by following: "Age group", "Gender level", "Education level", and "Immigration status". <br />
"Aboriginal status" has been commented.

In [109]:
# Dataset by training dataset inside "Average hourly wage"

print("\nAge group")
training_df_AvgHrsWages_ByAge = training_df_AvgHrsWages.loc[
    (training_df_AvgHrsWages['Characteristics'] == '15 to 24 years') |
    (training_df_AvgHrsWages['Characteristics'] == '25 to 34 years') |
    (training_df_AvgHrsWages['Characteristics'] == '35 to 44 years') |
    (training_df_AvgHrsWages['Characteristics'] == '45 to 54 years') |
    (training_df_AvgHrsWages['Characteristics'] == '55 to 64 years') |
    (training_df_AvgHrsWages['Characteristics'] == '65 years old and over')]
# print(training_df_AvgHrsWages_ByAge.head(20))
grouped = training_df_AvgHrsWages_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgHrsWages_ByAge.index))

print("\nGender group")
training_df_AvgHrsWages_ByGender = training_df_AvgHrsWages.loc[
    (training_df_AvgHrsWages['Characteristics'] == 'Female employees') |
    (training_df_AvgHrsWages['Characteristics'] == 'Male employees')
]
# print(training_df_AvgHrsWages_ByGender.head(20))
grouped = training_df_AvgHrsWages_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgHrsWages_ByGender.index))

print("\nEducation group")
training_df_AvgHrsWages_ByEducation = training_df_AvgHrsWages.loc[
    (training_df_AvgHrsWages['Characteristics'] == 'High school diploma and less') |
    (training_df_AvgHrsWages['Characteristics'] == 'Trade certificate') |
    (training_df_AvgHrsWages['Characteristics'] == 'University degree and higher')
]
# print(training_df_AvgHrsWages_ByEducation.head(20))
grouped = training_df_AvgHrsWages_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgHrsWages_ByEducation.index))

print("\nImmigrant group")
training_df_AvgHrsWages_ByImmigrant = training_df_AvgHrsWages.loc[
    (training_df_AvgHrsWages['Characteristics'] == 'Immigrant employees') |
    (training_df_AvgHrsWages['Characteristics'] == 'Non-immigrant employees')
]
# print(training_df_AvgHrsWages_ByImmigrant.head(20))
grouped = training_df_AvgHrsWages_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgHrsWages_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# training_df_AvgHrsWages_ByIndigenous = training_df_AvgHrsWages.loc[
#     (training_df_AvgHrsWages['Characteristics'] == 'Indigenous identity employees') |
#     (training_df_AvgHrsWages['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(training_df_AvgHrsWages_ByIndigenous.head(20))
# grouped = training_df_AvgHrsWages_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(training_df_AvgHrsWages_ByIndigenous.index))


Age group
                            sum  size
Characteristics                      
15 to 24 years          6394.37   384
25 to 34 years         10151.74   420
35 to 44 years         12429.33   420
45 to 54 years         13001.51   420
55 to 64 years         13540.12   420
65 years old and over  11253.83   384
The total number of this one is  2448

Gender group
                       sum  size
Characteristics                 
Female employees  11253.54   420
Male employees    12929.73   420
The total number of this one is  840

Education group
                                   sum  size
Characteristics                             
High school diploma and less   9141.89   420
Trade certificate              9828.67   396
University degree and higher  13154.99   396
The total number of this one is  1212

Immigrant group
                              sum  size
Characteristics                        
Immigrant employees      10867.07   396
Non-immigrant employees  11128.84   396
The tot

In [110]:
# Dataset by testing dataset inside "Average hourly wage"

print("\nAge group")
testing_df_AvgHrsWages_ByAge = testing_df_AvgHrsWages.loc[
    (testing_df_AvgHrsWages['Characteristics'] == '15 to 24 years') |
    (testing_df_AvgHrsWages['Characteristics'] == '25 to 34 years') |
    (testing_df_AvgHrsWages['Characteristics'] == '35 to 44 years') |
    (testing_df_AvgHrsWages['Characteristics'] == '45 to 54 years') |
    (testing_df_AvgHrsWages['Characteristics'] == '55 to 64 years') |
    (testing_df_AvgHrsWages['Characteristics'] == '65 years old and over')]
# print(testing_df_AvgHrsWages_ByAge.head(20))
grouped = testing_df_AvgHrsWages_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgHrsWages_ByAge.index))

print("\nGender group")
testing_df_AvgHrsWages_ByGender = testing_df_AvgHrsWages.loc[
    (testing_df_AvgHrsWages['Characteristics'] == 'Female employees') |
    (testing_df_AvgHrsWages['Characteristics'] == 'Male employees')
]
# print(testing_df_AvgHrsWages_ByGender.head(20))
grouped = testing_df_AvgHrsWages_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgHrsWages_ByGender.index))

print("\nEducation group")
testing_df_AvgHrsWages_ByEducation = testing_df_AvgHrsWages.loc[
    (testing_df_AvgHrsWages['Characteristics'] == 'High school diploma and less') |
    (testing_df_AvgHrsWages['Characteristics'] == 'Trade certificate') |
    (testing_df_AvgHrsWages['Characteristics'] == 'University degree and higher')
]
# print(testing_df_AvgHrsWages_ByEducation.head(20))
grouped = testing_df_AvgHrsWages_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgHrsWages_ByEducation.index))

print("\nImmigrant group")
testing_df_AvgHrsWages_ByImmigrant = testing_df_AvgHrsWages.loc[
    (testing_df_AvgHrsWages['Characteristics'] == 'Immigrant employees') |
    (testing_df_AvgHrsWages['Characteristics'] == 'Non-immigrant employees')
]
# print(testing_df_AvgHrsWages_ByImmigrant.head(20))
grouped = testing_df_AvgHrsWages_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgHrsWages_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# testing_df_AvgHrsWages_ByIndigenous = testing_df_AvgHrsWages.loc[
#     (testing_df_AvgHrsWages['Characteristics'] == 'Indigenous identity employees') |
#     (testing_df_AvgHrsWages['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(testing_df_AvgHrsWages_ByIndigenous.head(20))
# grouped = testing_df_AvgHrsWages_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(testing_df_AvgHrsWages_ByIndigenous.index))


Age group
                           sum  size
Characteristics                     
15 to 24 years         3689.12   192
25 to 34 years         5759.41   210
35 to 44 years         6991.24   210
45 to 54 years         7451.95   210
55 to 64 years         7475.92   210
65 years old and over  6329.58   192
The total number of this one is  1224

Gender group
                      sum  size
Characteristics                
Female employees  6439.32   210
Male employees    7203.15   210
The total number of this one is  420

Education group
                                  sum  size
Characteristics                            
High school diploma and less  5242.44   210
Trade certificate             5569.30   198
University degree and higher  7296.08   198
The total number of this one is  606

Immigrant group
                             sum  size
Characteristics                       
Immigrant employees      6065.65   198
Non-immigrant employees  6267.15   198
The total number of this one 

Filtered for "Average weekly hours worked" by following: "Age group", "Gender level", "Education level", and "Immigration status".<br />
"Aboriginal status" has been commented.

In [111]:
# Dataset by training dataset inside "Average weekly hours worked"

print("\nAge group")
training_df_AvgWeekHrsWrked_ByAge = training_df_AvgWeekHrsWrked.loc[
    (training_df_AvgWeekHrsWrked['Characteristics'] == '15 to 24 years') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == '25 to 34 years') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == '35 to 44 years') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == '45 to 54 years') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == '55 to 64 years') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == '65 years old and over')]
# print(training_df_AvgWeekHrsWrked_ByAge.head(20))
grouped = training_df_AvgWeekHrsWrked_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgWeekHrsWrked_ByAge.index))

print("\nGender group")
training_df_AvgWeekHrsWrked_ByGender = training_df_AvgWeekHrsWrked.loc[
    (training_df_AvgWeekHrsWrked['Characteristics'] == 'Female employees') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == 'Male employees')
]
# print(training_df_AvgWeekHrsWrked_ByGender.head(20))
grouped = training_df_AvgWeekHrsWrked_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgWeekHrsWrked_ByGender.index))

print("\nEducation group")
training_df_AvgWeekHrsWrked_ByEducation = training_df_AvgWeekHrsWrked.loc[
    (training_df_AvgWeekHrsWrked['Characteristics'] == 'High school diploma and less') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == 'Trade certificate') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == 'University degree and higher')
]
# print(training_df_AvgWeekHrsWrked_ByEducation.head(20))
grouped = training_df_AvgWeekHrsWrked_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgWeekHrsWrked_ByEducation.index))

print("\nImmigrant group")
training_df_AvgWeekHrsWrked_ByImmigrant = training_df_AvgWeekHrsWrked.loc[
    (training_df_AvgWeekHrsWrked['Characteristics'] == 'Immigrant employees') |
    (training_df_AvgWeekHrsWrked['Characteristics'] == 'Non-immigrant employees')
]
# print(training_df_AvgWeekHrsWrked_ByImmigrant.head(20))
grouped = training_df_AvgWeekHrsWrked_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_AvgWeekHrsWrked_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# training_df_AvgWeekHrsWrked_ByIndigenous = training_df_AvgWeekHrsWrked.loc[
#     (training_df_AvgWeekHrsWrked['Characteristics'] == 'Indigenous identity employees') |
#     (training_df_AvgWeekHrsWrked['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(training_df_AvgWeekHrsWrked_ByIndigenous.head(20))
# grouped = training_df_AvgWeekHrsWrked_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(training_df_AvgWeekHrsWrked_ByIndigenous.index))


Age group
                           sum  size
Characteristics                     
15 to 24 years          6701.0   384
25 to 34 years         13045.0   420
35 to 44 years         14441.0   420
45 to 54 years         14869.0   420
55 to 64 years         13528.0   420
65 years old and over   7948.0   384
The total number of this one is  2448

Gender group
                      sum  size
Characteristics                
Female employees  12380.0   420
Male employees    13436.0   420
The total number of this one is  840

Education group
                                  sum  size
Characteristics                            
High school diploma and less  10649.0   420
Trade certificate             11956.0   396
University degree and higher  13285.0   396
The total number of this one is  1212

Immigrant group
                             sum  size
Characteristics                       
Immigrant employees      12463.0   396
Non-immigrant employees  11969.0   396
The total number of this one

In [112]:
# Dataset by testing dataset inside "Average weekly hours worked"

print("\nAge group")
testing_df_AvgWeekHrsWrked_ByAge = testing_df_AvgWeekHrsWrked.loc[
    (testing_df_AvgWeekHrsWrked['Characteristics'] == '15 to 24 years') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == '25 to 34 years') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == '35 to 44 years') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == '45 to 54 years') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == '55 to 64 years') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == '65 years old and over')]
# print(testing_df_AvgWeekHrsWrked_ByAge.head(20))
grouped = testing_df_AvgWeekHrsWrked_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgWeekHrsWrked_ByAge.index))

print("\nGender group")
testing_df_AvgWeekHrsWrked_ByGender = testing_df_AvgWeekHrsWrked.loc[
    (testing_df_AvgWeekHrsWrked['Characteristics'] == 'Female employees') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == 'Male employees')
]
# print(testing_df_AvgWeekHrsWrked_ByGender.head(20))
grouped = testing_df_AvgWeekHrsWrked_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgWeekHrsWrked_ByGender.index))

print("\nEducation group")
testing_df_AvgWeekHrsWrked_ByEducation = testing_df_AvgWeekHrsWrked.loc[
    (testing_df_AvgWeekHrsWrked['Characteristics'] == 'High school diploma and less') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == 'Trade certificate') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == 'University degree and higher')
]
# print(testing_df_AvgWeekHrsWrked_ByEducation.head(20))
grouped = testing_df_AvgWeekHrsWrked_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgWeekHrsWrked_ByEducation.index))

print("\nImmigrant group")
testing_df_AvgWeekHrsWrked_ByImmigrant = testing_df_AvgWeekHrsWrked.loc[
    (testing_df_AvgWeekHrsWrked['Characteristics'] == 'Immigrant employees') |
    (testing_df_AvgWeekHrsWrked['Characteristics'] == 'Non-immigrant employees')
]
# print(testing_df_AvgWeekHrsWrked_ByImmigrant.head(20))
grouped = testing_df_AvgWeekHrsWrked_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_AvgWeekHrsWrked_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# testing_df_AvgWeekHrsWrked_ByIndigenous = testing_df_AvgWeekHrsWrked.loc[
#     (testing_df_AvgWeekHrsWrked['Characteristics'] == 'Indigenous identity employees') |
#     (testing_df_AvgWeekHrsWrked['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(testing_df_AvgWeekHrsWrked_ByIndigenous.head(20))
# grouped = testing_df_AvgWeekHrsWrked_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(testing_df_AvgWeekHrsWrked_ByIndigenous.index))


Age group
                          sum  size
Characteristics                    
15 to 24 years         3465.0   192
25 to 34 years         6413.0   210
35 to 44 years         7139.0   210
45 to 54 years         7406.0   210
55 to 64 years         6765.0   210
65 years old and over  3969.0   192
The total number of this one is  1224

Gender group
                     sum  size
Characteristics               
Female employees  6219.0   210
Male employees    6590.0   210
The total number of this one is  420

Education group
                                 sum  size
Characteristics                           
High school diploma and less  5303.0   210
Trade certificate             5889.0   198
University degree and higher  6568.0   198
The total number of this one is  606

Immigrant group
                            sum  size
Characteristics                      
Immigrant employees      6133.0   198
Non-immigrant employees  5961.0   198
The total number of this one is  396


Filtered for "Hours worked" by following: "Age group", "Gender level", "Education level", and "Immigration status".<br />
"Aboriginal status" has been commented.


In [113]:
# Dataset by training dataset inside "Hours Worked"

print("\nAge group in Alberta")
training_df_Hrs_Wrked_ByAge = training_df_Hrs_Wrked.loc[
    (training_df_Hrs_Wrked['Characteristics'] == '15 to 24 years') |
    (training_df_Hrs_Wrked['Characteristics'] == '25 to 34 years') |
    (training_df_Hrs_Wrked['Characteristics'] == '35 to 44 years') |
    (training_df_Hrs_Wrked['Characteristics'] == '45 to 54 years') |
    (training_df_Hrs_Wrked['Characteristics'] == '55 to 64 years') |
    (training_df_Hrs_Wrked['Characteristics'] == '65 years old and over')]
# print(training_df_Hrs_Wrked_ByAge.head(20))
grouped = training_df_Hrs_Wrked_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_Hrs_Wrked_ByAge.index))

print("\nGender group in Alberta")
training_df_Hrs_Wrked_ByGender = training_df_Hrs_Wrked.loc[
    (training_df_Hrs_Wrked['Characteristics'] == 'Female employees') |
    (training_df_Hrs_Wrked['Characteristics'] == 'Male employees')
]
# print(training_df_Hrs_Wrked_ByGender.head(20))
grouped = training_df_Hrs_Wrked_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_Hrs_Wrked_ByGender.index))

print("\nEducation group in Alberta")
training_df_Hrs_Wrked_ByEducation = training_df_Hrs_Wrked.loc[
    (training_df_Hrs_Wrked['Characteristics'] == 'High school diploma and less') |
    (training_df_Hrs_Wrked['Characteristics'] == 'Trade certificate') |
    (training_df_Hrs_Wrked['Characteristics'] == 'University degree and higher')
]
# print(training_df_Hrs_Wrked_ByEducation.head(20))
grouped = training_df_Hrs_Wrked_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_Hrs_Wrked_ByEducation.index))

print("\nImmigrant group in Alberta")
training_df_Hrs_Wrked_ByImmigrant = training_df_Hrs_Wrked.loc[
    (training_df_Hrs_Wrked['Characteristics'] == 'Immigrant employees') |
    (training_df_Hrs_Wrked['Characteristics'] == 'Non-immigrant employees')
]
# print(training_df_Hrs_Wrked_ByImmigrant.head(20))
grouped = training_df_Hrs_Wrked_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_Hrs_Wrked_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# training_df_Hrs_Wrked_ByIndigenous = training_df_Hrs_Wrked.loc[
#     (training_df_Hrs_Wrked['Characteristics'] == 'Indigenous identity employees') |
#     (training_df_Hrs_Wrked['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(training_df_Hrs_Wrked_ByIndigenous.head(20))
# grouped = training_df_Hrs_Wrked_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(training_df_Hrs_Wrked_ByIndigenous.index))


Age group in Alberta
                              sum  size
Characteristics                        
15 to 24 years          6319354.0   384
25 to 34 years         21892318.0   420
35 to 44 years         23526930.0   420
45 to 54 years         26229128.0   420
55 to 64 years         20321678.0   420
65 years old and over   4298529.0   384
The total number of this one is  2448

Gender group in Alberta
                         sum  size
Characteristics                   
Female employees  69941334.0   420
Male employees    32650634.0   420
The total number of this one is  840

Education group in Alberta
                                     sum  size
Characteristics                               
High school diploma and less  18744590.0   420
Trade certificate              7294759.0   396
University degree and higher  49076023.0   396
The total number of this one is  1212

Immigrant group in Alberta
                                sum  size
Characteristics                          
Immig

In [114]:
# Dataset by testing dataset inside "Hours Worked"

print("\nAge group in Alberta")
testing_df_Hrs_Wrked_ByAge = testing_df_Hrs_Wrked.loc[
    (testing_df_Hrs_Wrked['Characteristics'] == '15 to 24 years') |
    (testing_df_Hrs_Wrked['Characteristics'] == '25 to 34 years') |
    (testing_df_Hrs_Wrked['Characteristics'] == '35 to 44 years') |
    (testing_df_Hrs_Wrked['Characteristics'] == '45 to 54 years') |
    (testing_df_Hrs_Wrked['Characteristics'] == '55 to 64 years') |
    (testing_df_Hrs_Wrked['Characteristics'] == '65 years old and over')]
# print(testing_df_Hrs_Wrked_ByAge.head(20))
grouped = testing_df_Hrs_Wrked_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_Hrs_Wrked_ByAge.index))

print("\nGender group in Alberta")
testing_df_Hrs_Wrked_ByGender = testing_df_Hrs_Wrked.loc[
    (testing_df_Hrs_Wrked['Characteristics'] == 'Female employees') |
    (testing_df_Hrs_Wrked['Characteristics'] == 'Male employees')
]
# print(testing_df_Hrs_Wrked_ByGender.head(20))
grouped = testing_df_Hrs_Wrked_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_Hrs_Wrked_ByGender.index))

print("\nEducation group in Alberta")
testing_df_Hrs_Wrked_ByEducation = testing_df_Hrs_Wrked.loc[
    (testing_df_Hrs_Wrked['Characteristics'] == 'High school diploma and less') |
    (testing_df_Hrs_Wrked['Characteristics'] == 'Trade certificate') |
    (testing_df_Hrs_Wrked['Characteristics'] == 'University degree and higher')
]
# print(testing_df_Hrs_Wrked_ByEducation.head(20))
grouped = testing_df_Hrs_Wrked_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_Hrs_Wrked_ByEducation.index))

print("\nImmigrant group in Alberta")
testing_df_Hrs_Wrked_ByImmigrant = testing_df_Hrs_Wrked.loc[
    (testing_df_Hrs_Wrked['Characteristics'] == 'Immigrant employees') |
    (testing_df_Hrs_Wrked['Characteristics'] == 'Non-immigrant employees')
]
# print(testing_df_Hrs_Wrked_ByImmigrant.head(20))
grouped = testing_df_Hrs_Wrked_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_Hrs_Wrked_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# testing_df_Hrs_Wrked_ByIndigenous = testing_df_Hrs_Wrked.loc[
#     (testing_df_Hrs_Wrked['Characteristics'] == 'Indigenous identity employees') |
#     (testing_df_Hrs_Wrked['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(testing_df_Hrs_Wrked_ByIndigenous.head(20))
# grouped = testing_df_Hrs_Wrked_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(testing_df_Hrs_Wrked_ByIndigenous.index))


Age group in Alberta
                              sum  size
Characteristics                        
15 to 24 years          3132171.0   192
25 to 34 years         11857576.0   210
35 to 44 years         12919606.0   210
45 to 54 years         12974580.0   210
55 to 64 years         10690620.0   210
65 years old and over   2528517.0   192
The total number of this one is  1224

Gender group in Alberta
                         sum  size
Characteristics                   
Female employees  37193214.0   210
Male employees    16912308.0   210
The total number of this one is  420

Education group in Alberta
                                     sum  size
Characteristics                               
High school diploma and less   9285900.0   210
Trade certificate              3449203.0   198
University degree and higher  27393105.0   198
The total number of this one is  606

Immigrant group in Alberta
                                sum  size
Characteristics                          
Immigr

Filtered for "Number of jobs" by following: "Age group", "Gender level", "Education level", and "Immigration status".<br />
"Aboriginal status" has been commented.

In [115]:
# Dataset by training dataset inside "Number of jobs"

print("\nAge group in Alberta")
training_df_NumOfJob_ByAge = training_df_NumOfJob.loc[
    (training_df_NumOfJob['Characteristics'] == '15 to 24 years') |
    (training_df_NumOfJob['Characteristics'] == '25 to 34 years') |
    (training_df_NumOfJob['Characteristics'] == '35 to 44 years') |
    (training_df_NumOfJob['Characteristics'] == '45 to 54 years') |
    (training_df_NumOfJob['Characteristics'] == '55 to 64 years') |
    (training_df_NumOfJob['Characteristics'] == '65 years old and over')]
# print(training_df_NumOfJob_ByAge.head(20))
grouped = training_df_NumOfJob_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_NumOfJob_ByAge.index))

print("\nGender group in Alberta")
training_df_NumOfJob_ByGender = training_df_NumOfJob.loc[
    (training_df_NumOfJob['Characteristics'] == 'Female employees') |
    (training_df_NumOfJob['Characteristics'] == 'Male employees')
]
# print(training_df_NumOfJob_ByGender.head(20))
grouped = training_df_NumOfJob_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_NumOfJob_ByGender.index))

print("\nEducation group in Alberta")
training_df_NumOfJob_ByEducation = training_df_NumOfJob.loc[
    (training_df_NumOfJob['Characteristics'] == 'High school diploma and less') |
    (training_df_NumOfJob['Characteristics'] == 'Trade certificate') |
    (training_df_NumOfJob['Characteristics'] == 'University degree and higher')
]
# print(training_df_NumOfJob_ByEducation.head(20))
grouped = training_df_NumOfJob_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_NumOfJob_ByEducation.index))

print("\nImmigrant group in Alberta")
training_df_NumOfJob_ByImmigrant = training_df_NumOfJob.loc[
    (training_df_NumOfJob['Characteristics'] == 'Immigrant employees') |
    (training_df_NumOfJob['Characteristics'] == 'Non-immigrant employees')
]
# print(training_df_NumOfJob_ByImmigrant.head(20))
grouped = training_df_NumOfJob_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_NumOfJob_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# training_df_NumOfJob_ByIndigenous = training_df_NumOfJob.loc[
#     (training_df_NumOfJob['Characteristics'] == 'Indigenous identity employees') |
#     (training_df_NumOfJob['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(training_df_NumOfJob_ByIndigenous.head(20))
# grouped = training_df_NumOfJob_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(training_df_NumOfJob_ByIndigenous.index))


Age group in Alberta
                              sum  size
Characteristics                        
15 to 24 years          7043629.0   384
25 to 34 years         13985914.0   420
35 to 44 years         13654580.0   420
45 to 54 years         14554094.0   420
55 to 64 years         12313508.0   420
65 years old and over   4150652.0   384
The total number of this one is  2448

Gender group in Alberta
                         sum  size
Characteristics                   
Female employees  45510332.0   420
Male employees    20196478.0   420
The total number of this one is  840

Education group in Alberta
                                     sum  size
Characteristics                               
High school diploma and less  14427614.0   420
Trade certificate              4730099.0   396
University degree and higher  29514630.0   396
The total number of this one is  1212

Immigrant group in Alberta
                                sum  size
Characteristics                          
Immig

In [116]:
# Dataset by testing dataset inside "Number of jobs"

print("\nAge group in Alberta")
testing_df_NumOfJob_ByAge = testing_df_NumOfJob.loc[
    (testing_df_NumOfJob['Characteristics'] == '15 to 24 years') |
    (testing_df_NumOfJob['Characteristics'] == '25 to 34 years') |
    (testing_df_NumOfJob['Characteristics'] == '35 to 44 years') |
    (testing_df_NumOfJob['Characteristics'] == '45 to 54 years') |
    (testing_df_NumOfJob['Characteristics'] == '55 to 64 years') |
    (testing_df_NumOfJob['Characteristics'] == '65 years old and over')]
# print(testing_df_NumOfJob_ByAge.head(20))
grouped = testing_df_NumOfJob_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_NumOfJob_ByAge.index))

print("\nGender group in Alberta")
testing_df_NumOfJob_ByGender = testing_df_NumOfJob.loc[
    (testing_df_NumOfJob['Characteristics'] == 'Female employees') |
    (testing_df_NumOfJob['Characteristics'] == 'Male employees')
]
# print(testing_df_NumOfJob_ByGender.head(20))
grouped = testing_df_NumOfJob_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_NumOfJob_ByGender.index))

print("\nEducation group in Alberta")
testing_df_NumOfJob_ByEducation = testing_df_NumOfJob.loc[
    (testing_df_NumOfJob['Characteristics'] == 'High school diploma and less') |
    (testing_df_NumOfJob['Characteristics'] == 'Trade certificate') |
    (testing_df_NumOfJob['Characteristics'] == 'University degree and higher')
]
# print(testing_df_NumOfJob_ByEducation.head(20))
grouped = testing_df_NumOfJob_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_NumOfJob_ByEducation.index))

print("\nImmigrant group in Alberta")
testing_df_NumOfJob_ByImmigrant = testing_df_NumOfJob.loc[
    (testing_df_NumOfJob['Characteristics'] == 'Immigrant employees') |
    (testing_df_NumOfJob['Characteristics'] == 'Non-immigrant employees')
]
# print(testing_df_NumOfJob_ByImmigrant.head(20))
grouped = testing_df_NumOfJob_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_NumOfJob_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# testing_df_NumOfJob_ByIndigenous = testing_df_NumOfJob.loc[
#     (testing_df_NumOfJob['Characteristics'] == 'Indigenous identity employees') |
#     (testing_df_NumOfJob['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(testing_df_NumOfJob_ByIndigenous.head(20))
# grouped = testing_df_NumOfJob_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(testing_df_NumOfJob_ByIndigenous.index))


Age group in Alberta
                             sum  size
Characteristics                       
15 to 24 years         3414206.0   192
25 to 34 years         7615612.0   210
35 to 44 years         7559436.0   210
45 to 54 years         7201656.0   210
55 to 64 years         6451160.0   210
65 years old and over  2416851.0   192
The total number of this one is  1224

Gender group in Alberta
                         sum  size
Characteristics                   
Female employees  24112770.0   210
Male employees    10548592.0   210
The total number of this one is  420

Education group in Alberta
                                     sum  size
Characteristics                               
High school diploma and less   7204888.0   210
Trade certificate              2245738.0   198
University degree and higher  16513042.0   198
The total number of this one is  606

Immigrant group in Alberta
                                sum  size
Characteristics                          
Immigrant empl

Filtered for "Wages and Salaries" by following: "Age group", "Gender level", "Education level", and "Immigration status". <br />
"Aboriginal status" has been commented.

In [117]:
# Dataset training set inside "Wages and Salaries"

print("\nAge group in Alberta")
training_df_WagesAndSalaries_ByAge = training_df_WagesAndSalaries.loc[
    (training_df_WagesAndSalaries['Characteristics'] == '15 to 24 years') |
    (training_df_WagesAndSalaries['Characteristics'] == '25 to 34 years') |
    (training_df_WagesAndSalaries['Characteristics'] == '35 to 44 years') |
    (training_df_WagesAndSalaries['Characteristics'] == '45 to 54 years') |
    (training_df_WagesAndSalaries['Characteristics'] == '55 to 64 years') |
    (training_df_WagesAndSalaries['Characteristics'] == '65 years old and over')]
# print(training_df_WagesAndSalaries_ByAge.head(20))
grouped = training_df_WagesAndSalaries_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_WagesAndSalaries_ByAge.index))

print("\nGender group in Alberta")
training_df_WagesAndSalaries_ByGender = training_df_WagesAndSalaries.loc[
    (training_df_WagesAndSalaries['Characteristics'] == 'Female employees') |
    (training_df_WagesAndSalaries['Characteristics'] == 'Male employees')
]
# print(training_df_WagesAndSalaries_ByGender.head(20))
grouped = training_df_WagesAndSalaries_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_WagesAndSalaries_ByGender.index))

print("\nEducation group in Alberta")
training_df_WagesAndSalaries_ByEducation = training_df_WagesAndSalaries.loc[
    (training_df_WagesAndSalaries['Characteristics'] == 'High school diploma and less') |
    (training_df_WagesAndSalaries['Characteristics'] == 'Trade certificate') |
    (training_df_WagesAndSalaries['Characteristics'] == 'University degree and higher')
]
# print(training_df_WagesAndSalaries_ByEducation.head(20))
grouped = training_df_WagesAndSalaries_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_WagesAndSalaries_ByEducation.index))

print("\nImmigrant group in Alberta")
training_df_WagesAndSalaries_ByImmigrant = training_df_WagesAndSalaries.loc[
    (training_df_WagesAndSalaries['Characteristics'] == 'Immigrant employees') |
    (training_df_WagesAndSalaries['Characteristics'] == 'Non-immigrant employees')
]
# print(training_df_WagesAndSalaries_ByImmigrant.head(20))
grouped = training_df_WagesAndSalaries_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(training_df_WagesAndSalaries_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# training_df_WagesAndSalaries_ByIndigenous = training_df_WagesAndSalaries.loc[
#     (training_df_WagesAndSalaries['Characteristics'] == 'Indigenous identity employees') |
#     (training_df_WagesAndSalaries['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(training_df_WagesAndSalaries_ByIndigenous.head(20))
# grouped = training_df_WagesAndSalaries_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(training_df_WagesAndSalaries_ByIndigenous.index))


Age group in Alberta
                            sum  size
Characteristics                      
15 to 24 years         108973.0   384
25 to 34 years         538547.0   420
35 to 44 years         710804.0   420
45 to 54 years         842807.0   420
55 to 64 years         657465.0   420
65 years old and over  138783.0   384
The total number of this one is  2448

Gender group in Alberta
                        sum  size
Characteristics                  
Female employees  1953878.0   420
Male employees    1043622.0   420
The total number of this one is  840

Education group in Alberta
                                    sum  size
Characteristics                              
High school diploma and less   399426.0   420
Trade certificate              173261.0   396
University degree and higher  1690370.0   396
The total number of this one is  1212

Immigrant group in Alberta
                               sum  size
Characteristics                         
Immigrant employees       749325

In [118]:
# Dataset testing dataset inside "Wages and Salaries"

print("\nAge group in Alberta")
testing_df_WagesAndSalaries_ByAge = testing_df_WagesAndSalaries.loc[
    (testing_df_WagesAndSalaries['Characteristics'] == '15 to 24 years') |
    (testing_df_WagesAndSalaries['Characteristics'] == '25 to 34 years') |
    (testing_df_WagesAndSalaries['Characteristics'] == '35 to 44 years') |
    (testing_df_WagesAndSalaries['Characteristics'] == '45 to 54 years') |
    (testing_df_WagesAndSalaries['Characteristics'] == '55 to 64 years') |
    (testing_df_WagesAndSalaries['Characteristics'] == '65 years old and over')]
# print(testing_df_WagesAndSalaries_ByAge.head(20))
grouped = testing_df_WagesAndSalaries_ByAge.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_WagesAndSalaries_ByAge.index))

print("\nGender group in Alberta")
testing_df_WagesAndSalaries_ByGender = testing_df_WagesAndSalaries.loc[
    (testing_df_WagesAndSalaries['Characteristics'] == 'Female employees') |
    (testing_df_WagesAndSalaries['Characteristics'] == 'Male employees')
]
# print(testing_df_WagesAndSalaries_ByGender.head(20))
grouped = testing_df_WagesAndSalaries_ByGender.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_WagesAndSalaries_ByGender.index))

print("\nEducation group in Alberta")
testing_df_WagesAndSalaries_ByEducation = testing_df_WagesAndSalaries.loc[
    (testing_df_WagesAndSalaries['Characteristics'] == 'High school diploma and less') |
    (testing_df_WagesAndSalaries['Characteristics'] == 'Trade certificate') |
    (testing_df_WagesAndSalaries['Characteristics'] == 'University degree and higher')
]
# print(testing_df_WagesAndSalaries_ByEducation.head(20))
grouped = testing_df_WagesAndSalaries_ByEducation.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_WagesAndSalaries_ByEducation.index))

print("\nImmigrant group in Alberta")
testing_df_WagesAndSalaries_ByImmigrant = testing_df_WagesAndSalaries.loc[
    (testing_df_WagesAndSalaries['Characteristics'] == 'Immigrant employees') |
    (testing_df_WagesAndSalaries['Characteristics'] == 'Non-immigrant employees')
]
# print(testing_df_WagesAndSalaries_ByImmigrant.head(20))
grouped = testing_df_WagesAndSalaries_ByImmigrant.groupby(['Characteristics'])
print(grouped['VALUE'].agg([np.sum, np.size]))
print("The total number of this one is ",len(testing_df_WagesAndSalaries_ByImmigrant.index))

# print("\nIndigenous group in Alberta")
# testing_df_WagesAndSalaries_ByIndigenous = testing_df_WagesAndSalaries.loc[
#     (testing_df_WagesAndSalaries['Characteristics'] == 'Indigenous identity employees') |
#     (testing_df_WagesAndSalaries['Characteristics'] == 'Non-indigenous identity employees')
# ]
# # print(testing_df_WagesAndSalaries_ByIndigenous.head(20))
# grouped = testing_df_WagesAndSalaries_ByIndigenous.groupby(['Characteristics'])
# print(grouped['VALUE'].agg([np.sum, np.size]))
# print("The total number of this one is ",len(testing_df_WagesAndSalaries_ByIndigenous.index))


Age group in Alberta
                            sum  size
Characteristics                      
15 to 24 years          63069.0   192
25 to 34 years         334200.0   210
35 to 44 years         441049.0   210
45 to 54 years         474202.0   210
55 to 64 years         386319.0   210
65 years old and over   90513.0   192
The total number of this one is  1224

Gender group in Alberta
                        sum  size
Characteristics                  
Female employees  1179670.0   210
Male employees     609748.0   210
The total number of this one is  420

Education group in Alberta
                                    sum  size
Characteristics                              
High school diploma and less   226638.0   210
Trade certificate               93374.0   198
University degree and higher  1052815.0   198
The total number of this one is  606

Immigrant group in Alberta
                               sum  size
Characteristics                         
Immigrant employees       473697.

<h3> Part 8 - Other non-used indcators will be dropped </h3>

Next step, will be the final output.

Final Output for "Average annual hours worked"<br />
First being Training dataset and second being Testing dataset.

In [119]:
dfa_Target_To_Analysis = [training_df_AvgAnnHrsWrk_ByAge, training_df_AvgAnnHrsWrk_ByGender,training_df_AvgAnnHrsWrk_ByEducation, training_df_AvgAnnHrsWrk_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set By Age',
                                                                                      'training set By Gender',
                                                                                      'training set By Education',
                                                                                      'training set By Immigrant'])
dfa_Target_To_Analysis.print_result()

training set By Age
                            sum         mean    amin  median    amax  size
Characteristics                                                           
15 to 24 years         348240.0   906.875000   462.0   916.0  1185.0   384
25 to 34 years         677949.0  1614.164286  1236.0  1604.5  1967.0   420
35 to 44 years         750842.0  1787.719048  1436.0  1760.5  2183.0   420
45 to 54 years         773872.0  1842.552381  1523.0  1829.0  2312.0   420
55 to 64 years         703129.0  1674.116667  1390.0  1679.0  2000.0   420
65 years old and over  413498.0  1076.817708   701.0  1073.5  1500.0   384
Overall,
Sum :  3667530.0
Mean :  1498.174019607843
Min/median/max : 462.0 / 1646.0 / 2312.0
Standard Deviation :  371.3648448428451
Skewnewss :  -0.5841946247182938
Total size :  2448

training set By Gender
                       sum         mean    amin  median    amax  size
Characteristics                                                      
Female employees  643890.0  153

In [120]:
print("Histogram for training dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for training dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for training dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for training dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for training dataset by age
Histgram for training dataset by gender
Histgram for training dataset by education
Histgram for training dataset by immigrant


In [121]:
dfa_Target_To_Analysis = [testing_df_AvgAnnHrsWrk_ByAge, testing_df_AvgAnnHrsWrk_ByGender, testing_df_AvgAnnHrsWrk_ByEducation, testing_df_AvgAnnHrsWrk_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['testing set By Age',
                                                                                      'testing set By Gender',
                                                                                      'testing set By Education',
                                                                                      'testing set By Immigrant'])
dfa_Target_To_Analysis.print_result()

testing set By Age
                            sum         mean    amin  median    amax  size
Characteristics                                                           
15 to 24 years         179872.0   936.833333   713.0   927.5  1281.0   192
25 to 34 years         333662.0  1588.866667  1292.0  1576.0  1870.0   210
35 to 44 years         371386.0  1768.504762  1424.0  1757.0  2092.0   210
45 to 54 years         384987.0  1833.271429  1541.0  1826.5  2191.0   210
55 to 64 years         351843.0  1675.442857  1377.0  1675.0  2071.0   210
65 years old and over  206760.0  1076.875000   565.0  1076.5  1415.0   192
Overall,
Sum :  1828510.0
Mean :  1493.8807189542483
Min/median/max : 565.0 / 1633.0 / 2191.0
Standard Deviation :  357.1076307712267
Skewnewss :  -0.5829393561867995
Total size :  1224

testing set By Gender
                       sum         mean    amin  median    amax  size
Characteristics                                                      
Female employees  323755.0  1541

In [122]:
print("Histogram for testing dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for testing dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for testing dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for testing dataset by age
Histgram for testing dataset by gender
Histgram for testing dataset by education
Histgram for testing dataset by immigrant


Final Output for "Average annual wages and salaries"<br />
First being Training dataset and second being Testing dataset.

In [123]:
dfa_Target_To_Analysis = [training_df_AvgAnnWages_ByAge, training_df_AvgAnnWages_ByGender,training_df_AvgAnnWages_ByEducation, training_df_AvgAnnWages_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set By Age',
                                                                                      'training set By Gender',
                                                                                      'training set By Education',
                                                                                      'training set By Immigrant'])
dfa_Target_To_Analysis.print_result()

training set By Age
                              sum          mean     amin   median      amax  \
Characteristics                                                               
15 to 24 years          5844745.0  15220.690104   8769.0  14305.5   39963.0   
25 to 34 years         16279492.0  38760.695238  23433.0  37717.0   69687.0   
35 to 44 years         22063695.0  52532.607143  33680.0  51506.0   92783.0   
45 to 54 years         23873488.0  56841.638095  30761.0  55972.0   94580.0   
55 to 64 years         22698208.0  54043.352381  28716.0  51361.5  113680.0   
65 years old and over  12199299.0  31769.007812  14168.0  29624.0   81488.0   

                       size  
Characteristics              
15 to 24 years          384  
25 to 34 years          420  
35 to 44 years          420  
45 to 54 years          420  
55 to 64 years          420  
65 years old and over   384  
Overall,
Sum :  102958927.0
Mean :  42058.3852124183
Min/median/max : 8769.0 / 41458.0 / 113680.0
Standard 

In [124]:
print("Histogram for training dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for training dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for training dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for training dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for training dataset by age
Histgram for training dataset by gender
Histgram for training dataset by education
Histgram for training dataset by immigrant


In [125]:
dfa_Target_To_Analysis = [testing_df_AvgAnnWages_ByAge, testing_df_AvgAnnWages_ByGender, testing_df_AvgAnnWages_ByEducation, testing_df_AvgAnnWages_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['testing set By Age',
                                                                                      'testing set By Gender',
                                                                                      'testing set By Education',
                                                                                      'testing set By Immigrant'])
dfa_Target_To_Analysis.print_result()

testing set By Age
                              sum          mean     amin   median      amax  \
Characteristics                                                               
15 to 24 years          3482308.0  18137.020833  11093.0  16653.5   45844.0   
25 to 34 years          9108041.0  43371.623810  26534.0  42901.0   75429.0   
35 to 44 years         12296655.0  58555.500000  37336.0  56960.5   95714.0   
45 to 54 years         13595377.0  64739.890476  39455.0  64274.5  103580.0   
55 to 64 years         12592232.0  59963.009524  33296.0  56266.0  133071.0   
65 years old and over   6840988.0  35630.145833  18187.0  34170.0   76577.0   

                       size  
Characteristics              
15 to 24 years          192  
25 to 34 years          210  
35 to 44 years          210  
45 to 54 years          210  
55 to 64 years          210  
65 years old and over   192  
Overall,
Sum :  57915601.0
Mean :  47316.66748366013
Min/median/max : 11093.0 / 47033.5 / 133071.0
Standard 

In [126]:
print("Histogram for testing dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for testing dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for testing dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for testing dataset by age


Histgram for testing dataset by gender
Histgram for testing dataset by education
Histgram for testing dataset by immigrant


Final Output for "Average hourly wage"<br />
First being Training dataset and second being Testing dataset.

In [127]:
dfa_Target_To_Analysis = [training_df_AvgHrsWages_ByAge, training_df_AvgHrsWages_ByGender,training_df_AvgHrsWages_ByEducation, training_df_AvgHrsWages_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set By Age',
                                                                                      'training set By Gender',
                                                                                      'training set By Education',
                                                                                      'training set By Immigrant'])
dfa_Target_To_Analysis.print_result()

training set By Age
                            sum       mean   amin  median   amax  size
Characteristics                                                       
15 to 24 years          6394.37  16.652005  11.70  15.830  33.72   384
25 to 34 years         10151.74  24.170810  15.07  23.265  47.75   420
35 to 44 years         12429.33  29.593643  17.22  29.065  54.92   420
45 to 54 years         13001.51  30.955976  16.89  30.625  56.39   420
55 to 64 years         13540.12  32.238381  16.73  30.490  63.53   420
65 years old and over  11253.83  29.306849  15.90  27.780  69.41   384
Overall,
Sum :  66770.9
Mean :  27.275694444444444
Min/median/max : 11.7 / 26.57 / 69.41
Standard Deviation :  9.049315261791325
Skewnewss :  0.9740700424146075
Total size :  2448

training set By Gender
                       sum       mean   amin  median   amax  size
Characteristics                                                  
Female employees  11253.54  26.794143  15.83  25.925  50.72   420
Male emplo

In [128]:
print("Histogram for training dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for training dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for training dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for training dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for training dataset by age


Histgram for training dataset by gender
Histgram for training dataset by education
Histgram for training dataset by immigrant


In [129]:
dfa_Target_To_Analysis = [testing_df_AvgHrsWages_ByAge, testing_df_AvgHrsWages_ByGender, testing_df_AvgHrsWages_ByEducation, testing_df_AvgHrsWages_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['testing set By Age',
                                                                                      'testing set By Gender',
                                                                                      'testing set By Education',
                                                                                      'testing set By Immigrant'])
dfa_Target_To_Analysis.print_result()

testing set By Age
                           sum       mean   amin  median   amax  size
Characteristics                                                      
15 to 24 years         3689.12  19.214167  12.95  18.775  36.37   192
25 to 34 years         5759.41  27.425762  17.01  27.010  45.91   210
35 to 44 years         6991.24  33.291619  19.42  32.680  58.12   210
45 to 54 years         7451.95  35.485476  21.54  34.965  64.54   210
55 to 64 years         7475.92  35.599619  19.42  33.800  64.92   210
65 years old and over  6329.58  32.966563  18.32  32.140  66.63   192
Overall,
Sum :  37697.22
Mean :  30.79838235294118
Min/median/max : 12.95 / 30.425 / 66.63
Standard Deviation :  9.683878549205739
Skewnewss :  0.8175005207452904
Total size :  1224

testing set By Gender
                      sum       mean   amin  median   amax  size
Characteristics                                                 
Female employees  6439.32  30.663429  18.34  30.200  55.06   210
Male employees    720

In [130]:
print("Histogram for testing dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for testing dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for testing dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for testing dataset by age


Histgram for testing dataset by gender
Histgram for testing dataset by education
Histgram for testing dataset by immigrant


Final Output for "Average weekly hours worked"<br />
First being Training dataset and second being Testing dataset.

In [131]:
dfa_Target_To_Analysis = [training_df_AvgWeekHrsWrked_ByAge, training_df_Hrs_Wrked_ByGender,training_df_Hrs_Wrked_ByEducation, training_df_Hrs_Wrked_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set By Age',
                                                                                      'training set By Gender',
                                                                                      'training set By Education',
                                                                                      'training set By Immigrant'])
dfa_Target_To_Analysis.print_result()

training set By Age
                           sum       mean  amin  median  amax  size
Characteristics                                                    
15 to 24 years          6701.0  17.450521   9.0    18.0  23.0   384
25 to 34 years         13045.0  31.059524  24.0    31.0  38.0   420
35 to 44 years         14441.0  34.383333  28.0    34.0  42.0   420
45 to 54 years         14869.0  35.402381  29.0    35.0  44.0   420
55 to 64 years         13528.0  32.209524  27.0    32.0  38.0   420
65 years old and over   7948.0  20.697917  13.0    21.0  29.0   384
Overall,
Sum :  70532.0
Mean :  28.812091503267975
Min/median/max : 9.0 / 32.0 / 44.0
Standard Deviation :  7.142617025891534
Skewnewss :  -0.5867338647982026
Total size :  2448

training set By Gender
                         sum           mean  amin   median       amax  size
Characteristics                                                            
Female employees  69941334.0  166526.985714  57.0  27674.0  2635646.0   420
Male e

In [132]:
print("Histogram for training dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for training dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for training dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for training dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for training dataset by age


Histgram for training dataset by gender
Histgram for training dataset by education
Histgram for training dataset by immigrant


In [133]:
dfa_Target_To_Analysis = [testing_df_AvgWeekHrsWrked_ByAge, testing_df_AvgWeekHrsWrked_ByGender, testing_df_AvgWeekHrsWrked_ByEducation, testing_df_AvgWeekHrsWrked_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['testing set By Age',
                                                                                      'testing set By Gender',
                                                                                      'testing set By Education',
                                                                                      'testing set By Immigrant'])
dfa_Target_To_Analysis.print_result()

testing set By Age
                          sum       mean  amin  median  amax  size
Characteristics                                                   
15 to 24 years         3465.0  18.046875  14.0    18.0  25.0   192
25 to 34 years         6413.0  30.538095  25.0    30.0  36.0   210
35 to 44 years         7139.0  33.995238  27.0    34.0  40.0   210
45 to 54 years         7406.0  35.266667  30.0    35.0  42.0   210
55 to 64 years         6765.0  32.214286  26.0    32.0  40.0   210
65 years old and over  3969.0  20.671875  11.0    21.0  27.0   192
Overall,
Sum :  35157.0
Mean :  28.723039215686274
Min/median/max : 11.0 / 31.0 / 42.0
Standard Deviation :  6.87095753369461
Skewnewss :  -0.578814861266804
Total size :  1224

testing set By Gender
                     sum       mean  amin  median  amax  size
Characteristics                                              
Female employees  6219.0  29.614286  25.0    30.0  34.0   210
Male employees    6590.0  31.380952  26.0    32.0  35.0   2

In [134]:
print("Histogram for testing dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for testing dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for testing dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for testing dataset by age


Histgram for testing dataset by gender
Histgram for testing dataset by education
Histgram for testing dataset by immigrant


Final Output for "Hours Worked"<br />
First being Training dataset and second being Testing dataset.

In [135]:
dfa_Target_To_Analysis = [training_df_Hrs_Wrked_ByAge, training_df_Hrs_Wrked_ByGender,training_df_Hrs_Wrked_ByEducation, training_df_Hrs_Wrked_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set By Age',
                                                                                      'training set By Gender',
                                                                                      'training set By Education',
                                                                                      'training set By Immigrant'])
dfa_Target_To_Analysis.print_result()

training set By Age
                              sum          mean  amin   median      amax  size
Characteristics                                                               
15 to 24 years          6319354.0  16456.651042   6.0   4000.0  225988.0   384
25 to 34 years         21892318.0  52124.566667  32.0   9676.5  824205.0   420
35 to 44 years         23526930.0  56016.500000  32.0  10443.5  892149.0   420
45 to 54 years         26229128.0  62450.304762  29.0  10857.5  955066.0   420
55 to 64 years         20321678.0  48384.947619  20.0   8825.5  754767.0   420
65 years old and over   4298529.0  11194.085938  26.0   2495.5  168451.0   384
Overall,
Sum :  102587937.0
Mean :  41906.83700980392
Min/median/max : 6.0 / 6102.0 / 955066.0
Standard Deviation :  108499.46186114324
Skewnewss :  5.117834269815508
Total size :  2448

training set By Gender
                         sum           mean  amin   median       amax  size
Characteristics                                               

In [136]:
print("Histogram for training dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for training dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for training dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for training dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for training dataset by age


Histgram for training dataset by gender
Histgram for training dataset by education
Histgram for training dataset by immigrant


In [137]:
dfa_Target_To_Analysis = [testing_df_Hrs_Wrked_ByAge, testing_df_Hrs_Wrked_ByGender, testing_df_Hrs_Wrked_ByEducation, testing_df_Hrs_Wrked_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['testing set By Age',
                                                                                      'testing set By Gender',
                                                                                      'testing set By Education',
                                                                                      'testing set By Immigrant'])
dfa_Target_To_Analysis.print_result()

testing set By Age
                              sum          mean  amin   median      amax  size
Characteristics                                                               
15 to 24 years          3132171.0  16313.390625  29.0   3859.5  228566.0   192
25 to 34 years         11857576.0  56464.647619  32.0  10166.5  888764.0   210
35 to 44 years         12919606.0  61521.933333  33.0  10825.0  971688.0   210
45 to 54 years         12974580.0  61783.714286  31.0  10600.5  966480.0   210
55 to 64 years         10690620.0  50907.714286  25.0   8934.0  794712.0   210
65 years old and over   2528517.0  13169.359375  30.0   2913.5  183461.0   192
Overall,
Sum :  54103070.0
Mean :  44201.8545751634
Min/median/max : 25.0 / 6420.5 / 971688.0
Standard Deviation :  114827.01044742283
Skewnewss :  5.111311073923015
Total size :  1224

testing set By Gender
                         sum           mean  amin   median       amax  size
Characteristics                                                  

In [138]:
print("Histogram for testing dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for testing dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for testing dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for testing dataset by age


Histgram for testing dataset by gender
Histgram for testing dataset by education
Histgram for testing dataset by immigrant


Final Output for "Number of jobs"<br />
First being Training dataset and second being Testing dataset.

In [139]:
dfa_Target_To_Analysis = [training_df_NumOfJob_ByAge, training_df_NumOfJob_ByGender,training_df_NumOfJob_ByEducation, training_df_NumOfJob_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set By Age',
                                                                                      'training set By Gender',
                                                                                      'training set By Education',
                                                                                      'training set By Immigrant'])
dfa_Target_To_Analysis.print_result()

training set By Age
                              sum          mean  amin  median      amax  size
Characteristics                                                              
15 to 24 years          7043629.0  18342.783854  13.0  4215.0  246749.0   384
25 to 34 years         13985914.0  33299.795238  17.0  5885.0  527094.0   420
35 to 44 years         13654580.0  32510.904762  18.0  5906.0  516879.0   420
45 to 54 years         14554094.0  34652.604762  14.0  5993.0  530812.0   420
55 to 64 years         12313508.0  29317.876190  11.0  5355.0  453209.0   420
65 years old and over   4150652.0  10808.989583  18.0  2306.5  162525.0   384
Overall,
Sum :  65702377.0
Mean :  26839.206290849674
Min/median/max : 11.0 / 4297.5 / 530812.0
Standard Deviation :  65598.85047255474
Skewnewss :  4.799583077489634
Total size :  2448

training set By Gender
                         sum           mean  amin   median       amax  size
Characteristics                                                       

In [140]:
print("Histogram for training dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for training dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for training dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for training dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for training dataset by age


Histgram for training dataset by gender
Histgram for training dataset by education
Histgram for training dataset by immigrant


In [141]:
dfa_Target_To_Analysis = [testing_df_NumOfJob_ByAge, testing_df_NumOfJob_ByGender, testing_df_NumOfJob_ByEducation, testing_df_NumOfJob_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['testing set By Age',
                                                                                      'testing set By Gender',
                                                                                      'testing set By Education',
                                                                                      'testing set By Immigrant'])
dfa_Target_To_Analysis.print_result()

testing set By Age
                             sum          mean  amin  median      amax  size
Characteristics                                                             
15 to 24 years         3414206.0  17782.322917  24.0  4022.0  247960.0   192
25 to 34 years         7615612.0  36264.819048  19.0  6267.5  564788.0   210
35 to 44 years         7559436.0  35997.314286  18.0  6322.0  561685.0   210
45 to 54 years         7201656.0  34293.600000  16.0  5847.0  529125.0   210
55 to 64 years         6451160.0  30719.809524  13.0  5371.5  472689.0   210
65 years old and over  2416851.0  12587.765625  25.0  2729.0  173157.0   192
Overall,
Sum :  34658921.0
Mean :  28316.111928104576
Min/median/max : 13.0 / 4637.5 / 564788.0
Standard Deviation :  69539.28870799903
Skewnewss :  4.838307897340021
Total size :  1224

testing set By Gender
                         sum           mean  amin   median       amax  size
Characteristics                                                            
Fema

In [142]:
print("Histogram for testing dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for testing dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for testing dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for testing dataset by age
Histgram for testing dataset by gender
Histgram for testing dataset by education
Histgram for testing dataset by immigrant


Final Output for "Wages and Salaries"<br />
First being Training dataset and second being Testing dataset.

In [143]:
dfa_Target_To_Analysis = [training_df_WagesAndSalaries_ByAge, training_df_WagesAndSalaries_ByGender,training_df_WagesAndSalaries_ByEducation, training_df_WagesAndSalaries_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['training set By Age',
                                                                                      'training set By Gender',
                                                                                      'training set By Education',
                                                                                      'training set By Immigrant'])
dfa_Target_To_Analysis.print_result()

training set By Age
                            sum         mean  amin  median     amax  size
Characteristics                                                          
15 to 24 years         108973.0   283.783854   0.0    63.0   4250.0   384
25 to 34 years         538547.0  1282.254762   1.0   211.0  21911.0   420
35 to 44 years         710804.0  1692.390476   1.0   252.5  28840.0   420
45 to 54 years         842807.0  2006.683333   1.0   285.5  32326.0   420
55 to 64 years         657465.0  1565.392857   1.0   228.0  25923.0   420
65 years old and over  138783.0   361.414062   1.0    66.5   5823.0   384
Overall,
Sum :  2997379.0
Mean :  1224.4195261437908
Min/median/max : 0.0 / 140.0 / 32326.0
Standard Deviation :  3393.654254908312
Skewnewss :  5.409532570280809
Total size :  2448

training set By Gender
                        sum         mean  amin  median     amax  size
Characteristics                                                      
Female employees  1953878.0  4652.090476  

In [144]:
print("Histogram for training dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for training dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for training dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for training dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for training dataset by age


Histgram for training dataset by gender
Histgram for training dataset by education
Histgram for training dataset by immigrant


In [145]:
dfa_Target_To_Analysis = [testing_df_WagesAndSalaries_ByAge, testing_df_WagesAndSalaries_ByGender, testing_df_WagesAndSalaries_ByEducation, testing_df_WagesAndSalaries_ByImmigrant]
dfa_Target_To_Analysis = Target_To_Analysis(dfa_Target_To_Analysis, pd, np, pp, sns, ['testing set By Age',
                                                                                      'testing set By Gender',
                                                                                      'testing set By Education',
                                                                                      'testing set By Immigrant'])
dfa_Target_To_Analysis.print_result()

testing set By Age
                            sum         mean  amin  median     amax  size
Characteristics                                                          
15 to 24 years          63069.0   328.484375   1.0    69.0   4797.0   192
25 to 34 years         334200.0  1591.428571   1.0   236.5  25966.0   210
35 to 44 years         441049.0  2100.233333   2.0   314.5  34223.0   210
45 to 54 years         474202.0  2258.104762   1.0   317.0  36376.0   210
55 to 64 years         386319.0  1839.614286   1.0   260.5  29595.0   210
65 years old and over   90513.0   471.421875   1.0    88.0   6876.0   192
Overall,
Sum :  1789352.0
Mean :  1461.888888888889
Min/median/max : 1.0 / 160.5 / 36376.0
Standard Deviation :  4021.2164121982664
Skewnewss :  5.308031050696339
Total size :  1224

testing set By Gender
                        sum         mean  amin  median     amax  size
Characteristics                                                      
Female employees  1179670.0  5617.476190   3

In [146]:
print("Histogram for testing dataset by age")
dfa_Target_To_Analysis.print_histogram(0)

print("Histgram for testing dataset by gender")
dfa_Target_To_Analysis.print_histogram(1)

print("Histgram for testing dataset by education")
dfa_Target_To_Analysis.print_histogram(2)

print("Histgram for testing dataset by immigrant")
dfa_Target_To_Analysis.print_histogram(3)

Histogram for testing dataset by age


Histgram for testing dataset by gender
Histgram for testing dataset by education
Histgram for testing dataset by immigrant


Back up previous result to the CSV file.

In [147]:
# Save the dataframe to a CSV file

training_df_AvgAnnHrsWrk.to_csv('Result_By_Characteristics/training_df_AvgAnnHrsWrk.csv', index=False) # Average annual hours worked
testing_df_AvgAnnHrsWrk.to_csv('Result_By_Characteristics/testing_df_AvgAnnHrsWrk.csv', index=False)

training_df_AvgAnnWages.to_csv('Result_By_Characteristics/training_df_AvgAnnWages.csv', index=False) # Average annual wages and salaries
testing_df_AvgAnnWages.to_csv('Result_By_Characteristics/testing_df_AvgAnnWages.csv', index=False)

training_df_AvgHrsWages.to_csv('Result_By_Characteristics/training_df_AvgHrsWages.csv', index=False) # Average hourly wage
testing_df_AvgHrsWages.to_csv('Result_By_Characteristics/testing_df_AvgHrsWages.csv', index=False)

training_df_AvgWeekHrsWrked.to_csv('Result_By_Characteristics/training_df_AvgWeekHrsWrked.csv', index=False) # Average weekly hours worked
testing_df_AvgWeekHrsWrked.to_csv('Result_By_Characteristics/testing_df_AvgWeekHrsWrked.csv', index=False)

training_df_Hrs_Wrked.to_csv('Result_By_Characteristics/training_df_Hrs_Wrked.csv', index=False) # Hours Worked
testing_df_Hrs_Wrked.to_csv('Result_By_Characteristics/testing_df_Hrs_Wrked.csv', index=False)

training_df_NumOfJob.to_csv('Result_By_Characteristics/training_df_NumOfJob.csv', index=False) # Number of jobs
testing_df_NumOfJob.to_csv('Result_By_Characteristics/testing_df_NumOfJob.csv', index=False)

training_df_WagesAndSalaries.to_csv('Result_By_Characteristics/training_df_WagesAndSalaries.csv', index=False) # Wages and Salaries
testing_df_WagesAndSalaries.to_csv('Result_By_Characteristics/testing_df_WagesAndSalaries.csv', index=False)

<h3> Part 9 - Divide dataset by provinces but use only five provinces </h3>

<b>First part, divide by provinces</b>

In [148]:
# https://www.educative.io/blog/one-hot-encoding
# https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
# https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/
# https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
# https://www.baeldung.com/cs/train-test-datasets-ratio


In [149]:
print("Final steps, by sorting out by provinces.")

# -- sum          mean           std  size
# -- GEO                                                                    
# -- Alberta                     2193966.0   2031.450000   2695.836034  1080
# -- British Columbia            2401296.0   2223.422222   2804.925187  1080
# -- Canada                     18252439.0  16900.406481  22232.852533  1080
# -- Manitoba                     767802.0    710.927778    915.637659  1080
# -- New Brunswick                359320.0    332.703704    530.962762  1080
# -- Newfoundland and Labrador    315895.0    306.099806    482.634908  1032
# -- Northwest Territories         42804.0     41.476744     51.817046  1032
# -- Nova Scotia                  531805.0    492.412037    757.119411  1080
# -- Nunavut                       14235.0     15.208333     14.752372   936
# -- Ontario                     6601634.0   6112.624074   7594.433779  1080
# -- Prince Edward Island          77931.0     75.514535    121.297367  1032
# -- Quebec                      4271657.0   3955.237963   5580.294544  1080
# -- Saskatchewan                 650781.0    602.575000    876.896377  1080
# -- Yukon                         16914.0     18.070513     20.188135   936
# -- The total number of this one is  14688


Final steps, by sorting out by provinces.


Created directory called "Provinces" to managed files for split related to provinces dataset.

In [150]:
CreatedTheFile = toOrganizedOutputFiles('Result_By_Provinces')

Directory 'Result_By_Provinces' is ALREADY created


As final step, I am classifying the data by province. This is the final step, and this is where I will get the final result with.<br />
For this step, I will use class methods to avoid duplicated and repeatitive steps to do programming.<br />
For the complex of the analysis, only training (2013-2018) and testing set (2019-2021) are being used. <br />
However, 2010-2012 one will get commented.

Main class for Province Analysis:

In [151]:
# https://www.w3schools.com/python/python_classes.asp
# https://www.w3schools.com/python/python_for_loops.asp
# https://www.educba.com/multidimensional-array-in-python/

class ProvinceAnalysis:

    # Province :
    # -- ['Alberta',  'British Columbia',    'Canada' , 'Manitoba' , 'New Brunswick' 
    # 'Newfoundland and Labrador', 'Northwest Territories' , 'Nova Scotia' , 'Nunavut'
    # 'Ontario' , 'Prince Edward Island', 'Quebec', 'Saskatchewan', 'Yukon']

    def __init__(self, df, pd, np, pp):
        self.df = df
        self.province = ['Alberta',  'British Columbia', 'Canada', 'Manitoba', 
                        'New Brunswick', 'Newfoundland and Labrador', 
                        'Northwest Territories' , 'Nova Scotia' , 'Nunavut',
                        'Ontario' , 'Prince Edward Island', 'Quebec', 
                        'Saskatchewan', 'Yukon'
                        ]
        self.indicator = ["Average annual hours worked",
                        "Average annual wages and salaries",
                        "Average hourly wage",
                        "Average weekly hours worked",
                        "Hours Worked",
                        "Number of jobs",
                        "Wages and Salaries"]
        self.characteristic = ["Age group", "Gender", "Education Level", "Immigrant status", "Aboriginal status"]
        self.year = ["2010",
                    "below 2015",
                    "above 2016",
                    "2013",
                    "2016",
                    "2019"]
        self.pd = pd
        self.np = np
        self.pp = pp
        self.df_ByProvince = []
        for x in self.province:
            df_sorted = df.loc[df['GEO'] == x]
            self.df_ByProvince.append(df_sorted)

    def outputProvince(self, province_id):
        print(self.province[province_id])

    def outputIndicator(self, indicator_id):
        print(self.province[indicator_id])

    def outputCharacteristic(self, cha_id):
        print(self.province[cha_id])

    def outputYear(self, year_id):
        print(self.province[year_id])

    def outputAnalysis(self, province_id):
        print("\nGrab the dataset only in " + str(self.province[province_id]))
        grouped = self.df_ByProvince[province_id].groupby(['Characteristics'])
        print(grouped['VALUE'].agg([np.sum, np.mean, np.min, np.median, np.max, np.size]))
        print("")
        print("Overall,")
        print("Sum : ",np.sum(self.df_ByProvince[province_id]['VALUE']))
        print("Mean : ",np.mean(self.df_ByProvince[province_id]['VALUE']))
        print("Min/median/max :",np.min(self.df_ByProvince[province_id]['VALUE']),"/",
            np.median(self.df_ByProvince[province_id]['VALUE']),"/",
            np.max(self.df_ByProvince[province_id]['VALUE']))
        print("Skewnewss : ",self.df_ByProvince[province_id]['VALUE'].skew())
        print("Total size : ",len(self.df_ByProvince[province_id].index))

    def outputAnalysisSimple(self, province_id):
        print("\nGrab the dataset only in " + str(self.province[province_id]))
        grouped = self.df_ByProvince[province_id].groupby(['Characteristics'])
        print(grouped['VALUE'].agg([self.np.sum, self.np.mean, self.np.size]))

    def outputList(self, province_id, num):
        print("\nGrab the dataset only in " + str(self.province[province_id]))
        print(self.df_ByProvince[province_id].head(num))
        print(self.df_ByProvince[province_id].info())

    def outputPandaProfiling(self, province_id, indicator_id, type_id):

        fileName = str(self.indicator[indicator_id]) + " " + str(self.year[type_id])+" in " + str(self.province[province_id]) + ".html"
        
        pp = ProfileReport(self.df_ByProvince[province_id])
        pp_df = pp.to_html()

        print("File name will be saved under "+str(fileName))
        f = open(fileName, "a")  # Expert into html file without modifying any columns in dataset.
        f.write(pp_df)
        f.close()
    
    def print_histogram(self, province_id):
        sns.displot(data=self.province[province_id], x="VALUE", kind="hist", bins = 100, aspect = 1.5)
        plt.show()

    def outputFiveProvinces(self, pro1, pro2, pro3, pro4, pro5):
        frames = [self.df_ByProvince[pro1], self.df_ByProvince[pro2], self.df_ByProvince[pro3], self.df_ByProvince[pro4], self.df_ByProvince[pro5]]
        result = pd.concat(frames)
        return result

Filtered by provinces by "Average annual hours worked"

In [152]:
# By Average annual hours worked categories by provinces.

training_df_AvgAnnHrsWrk_ByAge_Provinces = ProvinceAnalysis(training_df_AvgAnnHrsWrk_ByAge, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByAge_Provinces = ProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByAge, pd, np, pp)

training_df_AvgAnnHrsWrk_ByGender_Provinces = ProvinceAnalysis(training_df_AvgAnnHrsWrk_ByGender, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByGender_Provinces = ProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByGender, pd, np, pp)

training_df_AvgAnnHrsWrk_ByEducation_Provinces = ProvinceAnalysis(training_df_AvgAnnHrsWrk_ByEducation, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByEducation_Provinces = ProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByEducation, pd, np, pp)

training_df_AvgAnnHrsWrk_ByImmigrant_Provinces = ProvinceAnalysis(training_df_AvgAnnHrsWrk_ByImmigrant, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByImmigrant_Provinces = ProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByImmigrant, pd, np, pp)

# training_df_AvgAnnHrsWrk_ByIndigenous_Provinces = ProvinceAnalysis(training_df_AvgAnnHrsWrk_ByIndigenous, pd, np, pp)
# testing_df_AvgAnnHrsWrk_ByIndigenous_Provinces = ProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByIndigenous, pd, np, pp)

Filtered by provinces by "Average wages and salaries"

In [153]:
# By Average annual wages and salaries worked categories by provinces.

training_df_AvgAnnWages_ByAge_Provinces = ProvinceAnalysis(training_df_AvgAnnWages_ByAge, pd, np, pp)
testing_df_AvgAnnWages_ByAge_Provinces = ProvinceAnalysis(testing_df_AvgAnnWages_ByAge, pd, np, pp)

training_df_AvgAnnWages_ByGender_Provinces = ProvinceAnalysis(training_df_AvgAnnWages_ByGender, pd, np, pp)
testing_df_AvgAnnWages_ByGender_Provinces = ProvinceAnalysis(testing_df_AvgAnnWages_ByGender, pd, np, pp)

training_df_AvgAnnWages_ByEducation_Provinces = ProvinceAnalysis(training_df_AvgAnnWages_ByEducation, pd, np, pp)
testing_df_AvgAnnWages_ByEducation_Provinces = ProvinceAnalysis(testing_df_AvgAnnWages_ByEducation, pd, np, pp)

training_df_AvgAnnWages_ByImmigrant_Provinces = ProvinceAnalysis(training_df_AvgAnnWages_ByImmigrant, pd, np, pp)
testing_df_AvgAnnWages_ByImmigrant_Provinces = ProvinceAnalysis(testing_df_AvgAnnWages_ByImmigrant, pd, np, pp)

# training_df_AvgAnnWages_ByIndigenous_Provinces = ProvinceAnalysis(training_df_AvgAnnWages_ByIndigenous, pd, np, pp)
# testing_df_AvgAnnWages_ByIndigenous_Provinces = ProvinceAnalysis(testing_df_AvgAnnWages_ByIndigenous, pd, np, pp)

Filtered by provinces by "Average hourly wage"

In [154]:
# By Average hourly wages and salaries worked categories by provinces.

training_df_AvgHrsWages_ByAge_Provinces = ProvinceAnalysis(training_df_AvgHrsWages_ByAge, pd, np, pp)
testing_df_AvgHrsWages_ByAge_Provinces = ProvinceAnalysis(testing_df_AvgHrsWages_ByAge, pd, np, pp)

training_df_AvgHrsWages_ByGender_Provinces = ProvinceAnalysis(training_df_AvgHrsWages_ByGender, pd, np, pp)
testing_df_AvgHrsWages_ByGender_Provinces = ProvinceAnalysis(testing_df_AvgHrsWages_ByGender, pd, np, pp)

training_df_AvgHrsWages_ByEducation_Provinces = ProvinceAnalysis(training_df_AvgHrsWages_ByEducation, pd, np, pp)
testing_df_AvgHrsWages_ByEducation_Provinces = ProvinceAnalysis(testing_df_AvgHrsWages_ByEducation, pd, np, pp)

training_df_AvgHrsWages_ByImmigrant_Provinces = ProvinceAnalysis(training_df_AvgHrsWages_ByImmigrant, pd, np, pp)
testing_df_AvgHrsWages_ByImmigrant_Provinces = ProvinceAnalysis(testing_df_AvgHrsWages_ByImmigrant, pd, np, pp)

# training_df_AvgHrsWages_ByIndigenous_Provinces = ProvinceAnalysis(training_df_AvgHrsWages_ByIndigenous, pd, np, pp)
# testing_df_AvgHrsWages_ByIndigenous_Provinces = ProvinceAnalysis(testing_df_AvgHrsWages_ByIndigenous, pd, np, pp)

Filtered by provinces by "Average weekly hours worked"

In [155]:
# By Average annual wages and salaries worked categories by provinces.

training_df_AvgWeekHrsWrked_ByAge_Provinces = ProvinceAnalysis(training_df_AvgWeekHrsWrked_ByAge, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByAge_Provinces = ProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByAge, pd, np, pp)

training_df_AvgWeekHrsWrked_ByGender_Provinces = ProvinceAnalysis(training_df_AvgWeekHrsWrked_ByGender, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByGender_Provinces = ProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByGender, pd, np, pp)

training_df_AvgWeekHrsWrked_ByEducation_Provinces = ProvinceAnalysis(training_df_AvgWeekHrsWrked_ByEducation, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByEducation_Provinces = ProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByEducation, pd, np, pp)

training_df_AvgWeekHrsWrked_ByImmigrant_Provinces = ProvinceAnalysis(training_df_AvgWeekHrsWrked_ByImmigrant, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByImmigrant_Provinces = ProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByImmigrant, pd, np, pp)

# training_df_AvgWeekHrsWrked_ByIndigenous_Provinces = ProvinceAnalysis(training_df_AvgWeekHrsWrked_ByIndigenous, pd, np, pp)
# testing_df_AvgWeekHrsWrked_ByIndigenous_Provinces = ProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByIndigenous, pd, np, pp)

Filtered by provinces by "Hours Worked"

In [156]:
# By Hours worked and salaries worked categories by provinces.

training_df_Hrs_Wrked_ByAge_Provinces = ProvinceAnalysis(training_df_Hrs_Wrked_ByAge, pd, np, pp)
testing_df_Hrs_Wrked_ByAge_Provinces = ProvinceAnalysis(testing_df_Hrs_Wrked_ByAge, pd, np, pp)

training_df_Hrs_Wrked_ByGender_Provinces = ProvinceAnalysis(training_df_Hrs_Wrked_ByGender, pd, np, pp)
testing_df_Hrs_Wrked_ByGender_Provinces = ProvinceAnalysis(testing_df_Hrs_Wrked_ByGender, pd, np, pp)

training_df_Hrs_Wrked_ByEducation_Provinces = ProvinceAnalysis(training_df_Hrs_Wrked_ByEducation, pd, np, pp)
testing_df_Hrs_Wrked_ByEducation_Provinces = ProvinceAnalysis(testing_df_Hrs_Wrked_ByEducation, pd, np, pp)

training_df_Hrs_Wrked_ByImmigrant_Provinces = ProvinceAnalysis(training_df_Hrs_Wrked_ByImmigrant, pd, np, pp)
testing_df_Hrs_Wrked_ByImmigrant_Provinces = ProvinceAnalysis(testing_df_Hrs_Wrked_ByImmigrant, pd, np, pp)

# training_df_Hrs_Wrked_ByIndigenous_Provinces = ProvinceAnalysis(training_df_Hrs_Wrked_ByIndigenous, pd, np, pp)
# testing_df_Hrs_Wrked_ByIndigenous_Provinces = ProvinceAnalysis(testing_df_Hrs_Wrked_ByIndigenous, pd, np, pp)


Filtered by provinces by "Number of jobs"

In [157]:
# By Number of jobs and salaries worked categories by provinces.

training_df_NumOfJob_ByAge_Provinces = ProvinceAnalysis(training_df_NumOfJob_ByAge, pd, np, pp)
testing_df_NumOfJob_ByAge_Provinces = ProvinceAnalysis(testing_df_NumOfJob_ByAge, pd, np, pp)

training_df_NumOfJob_ByGender_Provinces = ProvinceAnalysis(training_df_NumOfJob_ByGender, pd, np, pp)
testing_df_NumOfJob_ByGender_Provinces = ProvinceAnalysis(testing_df_NumOfJob_ByGender, pd, np, pp)

training_df_NumOfJob_ByEducation_Provinces = ProvinceAnalysis(training_df_NumOfJob_ByEducation, pd, np, pp)
testing_df_NumOfJob_ByEducation_Provinces = ProvinceAnalysis(testing_df_NumOfJob_ByEducation, pd, np, pp)

training_df_NumOfJob_ByImmigrant_Provinces = ProvinceAnalysis(training_df_NumOfJob_ByImmigrant, pd, np, pp)
testing_df_NumOfJob_ByImmigrant_Provinces = ProvinceAnalysis(testing_df_NumOfJob_ByImmigrant, pd, np, pp)

# training_df_NumOfJob_ByIndigenous_Provinces = ProvinceAnalysis(training_df_NumOfJob_ByIndigenous, pd, np, pp)
# testing_df_NumOfJob_ByIndigenous_Provinces = ProvinceAnalysis(testing_df_NumOfJob_ByIndigenous, pd, np, pp)

Filted by provinces by "Wages and Salaries"

In [158]:
# By Wages and Salaries worked categories by provinces.

training_df_WagesAndSalaries_ByAge_Provinces = ProvinceAnalysis(training_df_WagesAndSalaries_ByAge, pd, np, pp)
testing_df_WagesAndSalaries_ByAge_Provinces = ProvinceAnalysis(testing_df_WagesAndSalaries_ByAge, pd, np, pp)

training_df_WagesAndSalaries_ByGender_Provinces = ProvinceAnalysis(training_df_WagesAndSalaries_ByGender, pd, np, pp)
testing_df_WagesAndSalaries_ByGender_Provinces = ProvinceAnalysis(testing_df_WagesAndSalaries_ByGender, pd, np, pp)

training_df_WagesAndSalaries_ByEducation_Provinces = ProvinceAnalysis(training_df_WagesAndSalaries_ByEducation, pd, np, pp)
testing_df_WagesAndSalaries_ByEducation_Provinces = ProvinceAnalysis(testing_df_WagesAndSalaries_ByEducation, pd, np, pp)

training_df_WagesAndSalaries_ByImmigrant_Provinces = ProvinceAnalysis(training_df_WagesAndSalaries_ByImmigrant, pd, np, pp)
testing_df_WagesAndSalaries_ByImmigrant_Provinces = ProvinceAnalysis(testing_df_WagesAndSalaries_ByImmigrant, pd, np, pp)

# training_df_WagesAndSalaries_ByIndigenous_Provinces = ProvinceAnalysis(training_df_WagesAndSalaries_ByIndigenous, pd, np, pp)
# testing_df_WagesAndSalaries_ByIndigenous_Provinces = ProvinceAnalysis(testing_df_WagesAndSalaries_ByIndigenous, pd, np, pp)

<b>Next part, select five provinces and merge with previous divided dataset.</b>

Since there will be too many datasets to deal with if I were to working with many provinces. <br />
Instead I will be selected only five provinces to work on and will be put it in one database. <br />

Class methods to deal with five provinces,

In [159]:
# Import label encoder 
from sklearn import preprocessing 

class FiveProvinceAnalysis:

    # Province :
    # -- ['Alberta',  'British Columbia',    'Canada' , 'Manitoba' , 'New Brunswick' 
    # 'Newfoundland and Labrador', 'Northwest Territories' , 'Nova Scotia' , 'Nunavut'
    # 'Ontario' , 'Prince Edward Island', 'Quebec', 'Saskatchewan', 'Yukon']

    def __init__(self, df, pd, np, pp):

        # Based on this result, https://www.linkedin.com/pulse/5-best-provinces-canada-look-jobs-2023-/
        # Five popular province for employments, ['Alberta', 'BC', 'Nova Scotia', 'Ontario', 'Quebec']

        temp_df = df.outputFiveProvinces(0,1,7,9,11)
        
        self.df_FiveProvinces = temp_df.copy()
        self.one_hot_encoded_data = pd.get_dummies(temp_df, columns = ['GEO'])# , 'Characteristics']) 
        self.pd = pd
        self.np = np

    def convertCategoricalToNumericValue(self, ct):
        # ct = ['age', 'gender', 'education', 'immigrant', 'aboriginal']

        if (ct == 0):
            # https://www.statology.org/pandas-create-duplicate-column/
            # https://saturncloud.io/blog/how-to-replace-values-on-specific-columns-in-pandas/
            # Alternative, https://www.statology.org/data-binning-in-python/
            # Using Binning numerical variables from https://www.datacamp.com/tutorial/categorical-data

            # ['15 to 24 years' '25 to 34 years' '35 to 44 years' '45 to 54 years' '55 to 64 years' '65 years old and over']
            
            self.one_hot_encoded_data['Age_group'] = self.one_hot_encoded_data.loc[:,'Characteristics'] # pd.qcut(df['Characteristics'], q=3)
            # print(self.one_hot_encoded_data['Age_group'].unique())

            age_mapping = {
                '15 to 24 years': 20,
                '25 to 34 years': 30,
                '35 to 44 years': 40,
                '45 to 54 years': 50,
                '55 to 64 years': 60,
                '65 years old and over': 70
            }

            # Define a custom function
            def replace_agestr_with_number(age_group):
                return age_mapping.get(age_group, age_group)

            # Apply the custom function to the 'Age_Group' column
            # self.one_hot_encoded_data = df.copy()
            self.one_hot_encoded_data['Age_group'] = self.one_hot_encoded_data['Age_group'].apply(replace_agestr_with_number)
            # print(self.one_hot_encoded_data['Age_Group'].unique())
            # print(self.one_hot_encoded_data.head(20))
        elif (ct == 2):
            self.one_hot_encoded_data['Education_group'] = self.one_hot_encoded_data.loc[:,'Characteristics']

            education_mapping = {
                'High school diploma and less': 1,
                'Trade certificate': 2,
                'University degree and higher': 3,
            }

            # Define a custom function
            def replace_agestr_with_number(education_group):
                return education_mapping.get(education_group, education_group)

            # Apply the custom function to the 'Age_Group' column
            self.one_hot_encoded_data['Education_group'] = self.one_hot_encoded_data['Education_group'].apply(replace_agestr_with_number)
        elif (ct == 1) :
            # https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/

            self.one_hot_encoded_data['Gender_group'] = self.one_hot_encoded_data.loc[:,'Characteristics']
            
            # label_encoder object knows  
            # how to understand word labels. 
            label_encoder = preprocessing.LabelEncoder() 
            
            # Encode labels in column 'species'. 
            self.one_hot_encoded_data['Gender_group'] = label_encoder.fit_transform(self.one_hot_encoded_data['Gender_group'] ) 
        elif (ct == 3):
            # Repeat from ct = 1
            self.one_hot_encoded_data['Immigrant_status'] = self.one_hot_encoded_data.loc[:,'Characteristics']
            label_encoder = preprocessing.LabelEncoder() 
            self.one_hot_encoded_data['Immigrant_status'] = label_encoder.fit_transform(self.one_hot_encoded_data['Immigrant_status'] ) 
        elif (ct == 4):
            # Repeat from ct = 1
            self.one_hot_encoded_data['Aboriginal_status'] = self.one_hot_encoded_data.loc[:,'Characteristics']
            label_encoder = preprocessing.LabelEncoder() 
            self.one_hot_encoded_data['Aboriginal_status'] = label_encoder.fit_transform(self.one_hot_encoded_data['Aboriginal_status'] ) 
        else:
            print("Error! Not Egliable to convert.")

    def print_Unique(self, to_id):
        print(self.one_hot_encoded_data[to_id].unique())
    
    def print_pre_Unique(self, to_id):
        print(self.df_FiveProvinces[to_id].unique())

    def print_Province_Unique(self):
        self.print_Unique('GEO')

    def print_Value_Counts(self, to_id):
        print(self.df_FiveProvinces[to_id].value_counts())

    def print_Province_Value_Counts(self):
        self.print_Value_Count('GEO')

    def print_One_Hot_Encoded_Data(self):
        print(self.one_hot_encoded_data.head(20))

    def print_One_Hot_Encoded_Data_Info(self):
        print(self.one_hot_encoded_data.info())

    def output_One_Hot_Encoded_Data(self):
        return self.one_hot_encoded_data
    
    def output_Original_Data(self):
        return self.df_FiveProvinces
    
    def print_Analysis_Provinces(self):
        print("Five popular province for employments,")
        print(self.print_Province_Unique())
        print("Sources: https://www.linkedin.com/pulse/5-best-provinces-canada-look-jobs-2023-/")

    def print_AnalysisResult_ByCharacteristics(self):
        grouped = self.df_FiveProvinces.groupby(['Characteristics'])
        print(grouped['VALUE'].agg([np.sum, np.mean, np.min, np.median, np.max, np.std, np.size]))
        print("")
        print("Overall,")
        print("Sum : ",np.sum(self.df_FiveProvinces['VALUE']))
        print("Mean : ",np.mean(self.df_FiveProvinces['VALUE']))
        print("Min/median/max :",np.min(self.df_FiveProvinces['VALUE']),"/",
            np.median(self.df_FiveProvinces['VALUE']),"/",
            np.max(self.df_FiveProvinces['VALUE']))
        print("Standard Deviation: ",np.std(self.df_FiveProvinces['VALUE']))
        print("Skewnewss : ",self.df_FiveProvinces['VALUE'].skew())
        print("Total size : ",len(self.df_FiveProvinces.index))

    def print_AnalysisResult_ByProvinces(self):
        grouped = self.df_FiveProvinces.groupby(['GEO'])
        print(grouped['VALUE'].agg([np.sum, np.mean, np.min, np.median, np.max, np.std, np.size]))
        print("")
        print("Overall,")
        print("Sum : ",np.sum(self.df_FiveProvinces['VALUE']))
        print("Mean : ",np.mean(self.df_FiveProvinces['VALUE']))
        print("Min/median/max :",np.min(self.df_FiveProvinces['VALUE']),"/",
            np.median(self.df_FiveProvinces['VALUE']),"/",
            np.max(self.df_FiveProvinces['VALUE']))
        print("Standard Deviation: ",np.std(self.df_FiveProvinces['VALUE']))
        print("Skewnewss : ",self.df_FiveProvinces['VALUE'].skew())
        print("Total size : ",len(self.df_FiveProvinces.index))

    def print_histogram(self):
        sns.displot(data=self.df_FiveProvinces, x="VALUE", kind="hist", bins = 100, aspect = 1.5)
        plt.show()

In [160]:
training_df_AvgAnnHrsWrk_ByAge_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnHrsWrk_ByAge_Provinces, pd, np, pp)
training_df_AvgAnnHrsWrk_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByAge_Provinces, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces.print_One_Hot_Encoded_Data()
testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces.print_Unique('Age_group')


training_df_AvgAnnHrsWrk_ByGender_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnHrsWrk_ByGender_Provinces, pd, np, pp)
training_df_AvgAnnHrsWrk_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgAnnHrsWrk_ByGender_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByGender_Provinces, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgAnnHrsWrk_ByGender_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgAnnHrsWrk_ByGender_FiveProvinces.print_Unique('Gender_group')

training_df_AvgAnnHrsWrk_ByEducation_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnHrsWrk_ByEducation_Provinces, pd, np, pp)
training_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_AvgAnnHrsWrk_ByEducation_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByEducation_Provinces, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.print_Unique('Education_group')

training_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnHrsWrk_ByImmigrant_Provinces, pd, np, pp)
training_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnHrsWrk_ByImmigrant_Provinces, pd, np, pp)
testing_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.print_Unique('Immigrant_status')

# training_df_AvgAnnHrsWrk_ByIndigenous_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnHrsWrk_ByIndigenous_Provinces, pd, np, pp)
# training_df_AvgAnnHrsWrk_ByIndigenous_FiveProvinces.convertCategoricalToNumericValue(4)
# testing_df_AvgAnnHrsWrk_ByIndigenous_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnHrsWrk_ByIndigenous_Provinces, pd, np, pp)
# testing_df_AvgAnnHrsWrk_ByIndigenous_FiveProvinces.convertCategoricalToNumericValue(4)
# testing_df_AvgAnnHrsWrk_ByIndigenous_FiveProvinces.print_One_Hot_Encoded_Data_Info()
# testing_df_AvgAnnHrsWrk_ByIndigenous_FiveProvinces.print_Unique('Aboriginal_status')

<class 'pandas.core.frame.DataFrame'>
Index: 450 entries, 85137 to 100796
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   REF_DATE              450 non-null    int64  
 1   DGUID                 450 non-null    object 
 2   Sector                450 non-null    object 
 3   Characteristics       450 non-null    object 
 4   Indicators            450 non-null    object 
 5   UOM                   450 non-null    object 
 6   SCALAR_FACTOR         450 non-null    object 
 7   VALUE                 450 non-null    float64
 8   GEO_Alberta           450 non-null    bool   
 9   GEO_British Columbia  450 non-null    bool   
 10  GEO_Nova Scotia       450 non-null    bool   
 11  GEO_Ontario           450 non-null    bool   
 12  GEO_Quebec            450 non-null    bool   
 13  Age_group             450 non-null    int64  
dtypes: bool(5), float64(1), int64(2), object(6)
memory usage: 37.4+ KB
None


In [165]:
training_df_AvgAnnWages_ByAge_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnWages_ByAge_Provinces, pd, np, pp)
training_df_AvgAnnWages_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgAnnWages_ByAge_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnWages_ByAge_Provinces, pd, np, pp)
testing_df_AvgAnnWages_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgAnnWages_ByAge_FiveProvinces.print_One_Hot_Encoded_Data_Info()
# testing_df_AvgAnnWages_ByAge_FiveProvinces.print_Unique('Age_Group')

training_df_AvgAnnWages_ByGender_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnWages_ByGender_Provinces, pd, np, pp)
training_df_AvgAnnWages_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgAnnWages_ByGender_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnWages_ByGender_Provinces, pd, np, pp)
testing_df_AvgAnnWages_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgAnnWages_ByGender_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgAnnWages_ByGender_FiveProvinces.print_Unique('Gender_group')

training_df_AvgAnnWages_ByEducation_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnWages_ByEducation_Provinces, pd, np, pp)
training_df_AvgAnnWages_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_AvgAnnWages_ByEducation_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnWages_ByEducation_Provinces, pd, np, pp)
testing_df_AvgAnnWages_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_AvgAnnWages_ByEducation_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgAnnWages_ByEducation_FiveProvinces.print_Unique('Education_group')

training_df_AvgAnnWages_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnWages_ByImmigrant_Provinces, pd, np, pp)
training_df_AvgAnnWages_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgAnnWages_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgAnnWages_ByImmigrant_Provinces, pd, np, pp)
testing_df_AvgAnnWages_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgAnnWages_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgAnnWages_ByImmigrant_FiveProvinces.print_Unique('Immigrant_status')

# training_df_AvgAnnWages_ByIndigenous_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnWages_ByIndigenous_Provinces, pd, np, pp)
# training_df_AvgAnnWages_ByIndigenous_FiveProvinces.convertCategoricalToNumericValue(4)
# testing_df_AvgAnnWages_ByIndigenous_FiveProvinces = FiveProvinceAnalysis(training_df_AvgAnnWages_ByIndigenous_Provinces, pd, np, pp)
# testing_df_AvgAnnWages_ByIndigenous_FiveProvinces.convertCategoricalToNumericValue(4)
# testing_df_AvgAnnWages_ByIndigenous_FiveProvinces.print_One_Hot_Encoded_Data_Info()
# testing_df_AvgAnnWages_ByIndigenous_FiveProvinces.print_Unique('Immigrant_status')

<class 'pandas.core.frame.DataFrame'>
Index: 450 entries, 85139 to 100798
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   REF_DATE              450 non-null    int64  
 1   DGUID                 450 non-null    object 
 2   Sector                450 non-null    object 
 3   Characteristics       450 non-null    object 
 4   Indicators            450 non-null    object 
 5   UOM                   450 non-null    object 
 6   SCALAR_FACTOR         450 non-null    object 
 7   VALUE                 450 non-null    float64
 8   GEO_Alberta           450 non-null    bool   
 9   GEO_British Columbia  450 non-null    bool   
 10  GEO_Nova Scotia       450 non-null    bool   
 11  GEO_Ontario           450 non-null    bool   
 12  GEO_Quebec            450 non-null    bool   
 13  Age_group             450 non-null    int64  
dtypes: bool(5), float64(1), int64(2), object(6)
memory usage: 37.4+ KB
None


In [166]:
training_df_AvgHrsWages_ByAge_FiveProvinces = FiveProvinceAnalysis(training_df_AvgHrsWages_ByAge_Provinces, pd, np, pp)
training_df_AvgHrsWages_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgHrsWages_ByAge_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgHrsWages_ByAge_Provinces, pd, np, pp)
testing_df_AvgHrsWages_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgHrsWages_ByAge_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgHrsWages_ByAge_FiveProvinces.print_Unique('Age_group')

training_df_AvgHrsWages_ByGender_FiveProvinces = FiveProvinceAnalysis(training_df_AvgHrsWages_ByGender_Provinces, pd, np, pp)
training_df_AvgHrsWages_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgHrsWages_ByGender_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgHrsWages_ByGender_Provinces, pd, np, pp)
testing_df_AvgHrsWages_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgHrsWages_ByGender_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgHrsWages_ByGender_FiveProvinces.print_Unique('Gender_group')

training_df_AvgHrsWages_ByEducation_FiveProvinces = FiveProvinceAnalysis(training_df_AvgHrsWages_ByEducation_Provinces, pd, np, pp)
training_df_AvgHrsWages_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_AvgHrsWages_ByEducation_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgHrsWages_ByEducation_Provinces, pd, np, pp)
testing_df_AvgHrsWages_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_AvgHrsWages_ByEducation_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgHrsWages_ByEducation_FiveProvinces.print_Unique('Education_group')

training_df_AvgHrsWages_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(training_df_AvgHrsWages_ByImmigrant_Provinces, pd, np, pp)
training_df_AvgHrsWages_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgHrsWages_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgHrsWages_ByImmigrant_Provinces, pd, np, pp)
testing_df_AvgHrsWages_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgHrsWages_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgHrsWages_ByImmigrant_FiveProvinces.print_Unique('Immigrant_status')

# training_df_AvgHrsWages_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_AvgHrsWages_ByIndigenous_Provinces, pd, np, pp)
# training_df_AvgHrsWages_ByImmigrant_FiveProvince.print_One_Hot_Encoded_Data_Info()
# testing_df_AvgHrsWages_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_AvgHrsWages_ByIndigenous_Provinces, pd, np, pp)
# testing_df_AvgHrsWages_ByImmigrant_FiveProvince.print_One_Hot_Encoded_Data_Info()

<class 'pandas.core.frame.DataFrame'>
Index: 450 entries, 85140 to 100799
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   REF_DATE              450 non-null    int64  
 1   DGUID                 450 non-null    object 
 2   Sector                450 non-null    object 
 3   Characteristics       450 non-null    object 
 4   Indicators            450 non-null    object 
 5   UOM                   450 non-null    object 
 6   SCALAR_FACTOR         450 non-null    object 
 7   VALUE                 450 non-null    float64
 8   GEO_Alberta           450 non-null    bool   
 9   GEO_British Columbia  450 non-null    bool   
 10  GEO_Nova Scotia       450 non-null    bool   
 11  GEO_Ontario           450 non-null    bool   
 12  GEO_Quebec            450 non-null    bool   
 13  Age_group             450 non-null    int64  
dtypes: bool(5), float64(1), int64(2), object(6)
memory usage: 37.4+ KB
None


In [168]:
training_df_AvgWeekHrsWrked_ByAge_FiveProvinces = FiveProvinceAnalysis(training_df_AvgWeekHrsWrked_ByAge_Provinces, pd, np, pp)
training_df_AvgWeekHrsWrked_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgWeekHrsWrked_ByAge_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByAge_Provinces, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_AvgWeekHrsWrked_ByAge_FiveProvinces.print_One_Hot_Encoded_Data_Info()
# testing_df_AvgWeekHrsWrked_ByAge_FiveProvinces.print_Unique('Age_Group')

training_df_AvgWeekHrsWrked_ByGender_FiveProvinces = FiveProvinceAnalysis(training_df_AvgWeekHrsWrked_ByGender_Provinces, pd, np, pp)
training_df_AvgWeekHrsWrked_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgWeekHrsWrked_ByGender_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByGender_Provinces, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_AvgWeekHrsWrked_ByGender_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgWeekHrsWrked_ByGender_FiveProvinces.print_Unique('Gender_group')

training_df_AvgWeekHrsWrked_ByEducation_FiveProvinces = FiveProvinceAnalysis(training_df_AvgWeekHrsWrked_ByEducation_Provinces, pd, np, pp)
training_df_AvgWeekHrsWrked_ByEducation_FiveProvinces .convertCategoricalToNumericValue(2)
testing_df_AvgWeekHrsWrked_ByEducation_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByEducation_Provinces, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.print_Unique('Education_group')
testing_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.print_One_Hot_Encoded_Data_Info()

training_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(training_df_AvgWeekHrsWrked_ByImmigrant_Provinces, pd, np, pp)
training_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(testing_df_AvgWeekHrsWrked_ByImmigrant_Provinces, pd, np, pp)
testing_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.print_Unique('Immigrant_status')

# training_df_AvgWeekHrsWrked_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_AvgWeekHrsWrkeds_ByIndigenous_Provinces, pd, np, pp)
# training_df_AvgWeekHrsWrked_ByImmigrant_FiveProvince.print_One_Hot_Encoded_Data_Info()
# testing_df_AvgWeekHrsWrked_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_AvgWeekHrsWrkeds_ByIndigenous_Provinces, pd, np, pp)

<class 'pandas.core.frame.DataFrame'>
Index: 450 entries, 85138 to 100797
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   REF_DATE              450 non-null    int64  
 1   DGUID                 450 non-null    object 
 2   Sector                450 non-null    object 
 3   Characteristics       450 non-null    object 
 4   Indicators            450 non-null    object 
 5   UOM                   450 non-null    object 
 6   SCALAR_FACTOR         450 non-null    object 
 7   VALUE                 450 non-null    float64
 8   GEO_Alberta           450 non-null    bool   
 9   GEO_British Columbia  450 non-null    bool   
 10  GEO_Nova Scotia       450 non-null    bool   
 11  GEO_Ontario           450 non-null    bool   
 12  GEO_Quebec            450 non-null    bool   
 13  Age_group             450 non-null    int64  
dtypes: bool(5), float64(1), int64(2), object(6)
memory usage: 37.4+ KB
None


In [169]:
training_df_Hrs_Wrked_ByAge_FiveProvinces = FiveProvinceAnalysis(training_df_Hrs_Wrked_ByAge_Provinces, pd, np, pp)
training_df_Hrs_Wrked_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_Hrs_Wrked_ByAge_FiveProvinces = FiveProvinceAnalysis(testing_df_Hrs_Wrked_ByAge_Provinces, pd, np, pp)
testing_df_Hrs_Wrked_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_Hrs_Wrked_ByAge_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_Hrs_Wrked_ByAge_FiveProvinces.print_Unique('Age_group')

training_df_Hrs_Wrked_ByGender_FiveProvinces = FiveProvinceAnalysis(training_df_Hrs_Wrked_ByGender_Provinces, pd, np, pp)
training_df_Hrs_Wrked_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_Hrs_Wrked_ByGender_FiveProvinces = FiveProvinceAnalysis(testing_df_Hrs_Wrked_ByGender_Provinces, pd, np, pp)
testing_df_Hrs_Wrked_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_Hrs_Wrked_ByGender_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_Hrs_Wrked_ByGender_FiveProvinces.print_Unique('Gender_group')

training_df_Hrs_Wrked_ByEducation_FiveProvinces = FiveProvinceAnalysis(training_df_Hrs_Wrked_ByEducation_Provinces, pd, np, pp)
training_df_Hrs_Wrked_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_Hrs_Wrked_ByEducation_FiveProvinces = FiveProvinceAnalysis(testing_df_Hrs_Wrked_ByEducation_Provinces, pd, np, pp)
testing_df_Hrs_Wrked_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_Hrs_Wrked_ByEducation_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_Hrs_Wrked_ByEducation_FiveProvinces.print_Unique('Education_group')

training_df_Hrs_Wrked_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(training_df_Hrs_Wrked_ByImmigrant_Provinces, pd, np, pp)
training_df_Hrs_Wrked_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_Hrs_Wrked_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(testing_df_Hrs_Wrked_ByImmigrant_Provinces, pd, np, pp)
testing_df_Hrs_Wrked_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_Hrs_Wrked_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_Hrs_Wrked_ByImmigrant_FiveProvinces.print_Unique('Immigrant_status')

# training_df_Hrs_Wrkeds_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_Hrs_Wrkeds_ByIndigenous_Provinces, pd, np, pp)
# training_df_Hrs_Wrkeds_ByImmigrant_FiveProvince.print_One_Hot_Encoded_Data_Info()
# testing_df_Hrs_Wrkeds_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_Hrs_Wrkeds_ByIndigenous_Provinces, pd, np, pp)
# testing_df_Hrs_Wrkeds_ByImmigrant_FiveProvince.print_One_Hot_Encoded_Data_Info()

<class 'pandas.core.frame.DataFrame'>
Index: 450 entries, 85135 to 100794
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   REF_DATE              450 non-null    int64  
 1   DGUID                 450 non-null    object 
 2   Sector                450 non-null    object 
 3   Characteristics       450 non-null    object 
 4   Indicators            450 non-null    object 
 5   UOM                   450 non-null    object 
 6   SCALAR_FACTOR         450 non-null    object 
 7   VALUE                 450 non-null    float64
 8   GEO_Alberta           450 non-null    bool   
 9   GEO_British Columbia  450 non-null    bool   
 10  GEO_Nova Scotia       450 non-null    bool   
 11  GEO_Ontario           450 non-null    bool   
 12  GEO_Quebec            450 non-null    bool   
 13  Age_group             450 non-null    int64  
dtypes: bool(5), float64(1), int64(2), object(6)
memory usage: 37.4+ KB
None


In [170]:
training_df_NumOfJob_ByAge_FiveProvinces = FiveProvinceAnalysis(training_df_NumOfJob_ByAge_Provinces, pd, np, pp)
training_df_NumOfJob_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_NumOfJob_ByAge_FiveProvinces = FiveProvinceAnalysis(testing_df_NumOfJob_ByAge_Provinces, pd, np, pp)
testing_df_NumOfJob_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_NumOfJob_ByAge_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_NumOfJob_ByAge_FiveProvinces.print_Unique('Age_group')

training_df_NumOfJob_ByGender_FiveProvinces = FiveProvinceAnalysis(training_df_NumOfJob_ByGender_Provinces, pd, np, pp)
training_df_NumOfJob_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_NumOfJob_ByGender_FiveProvinces = FiveProvinceAnalysis(testing_df_NumOfJob_ByGender_Provinces, pd, np, pp)
testing_df_NumOfJob_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_NumOfJob_ByGender_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_NumOfJob_ByGender_FiveProvinces.print_Unique('Gender_group')

training_df_NumOfJob_ByEducation_FiveProvinces = FiveProvinceAnalysis(training_df_NumOfJob_ByEducation_Provinces, pd, np, pp)
training_df_NumOfJob_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_NumOfJob_ByEducation_FiveProvinces = FiveProvinceAnalysis(testing_df_NumOfJob_ByEducation_Provinces, pd, np, pp)
testing_df_NumOfJob_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_NumOfJob_ByEducation_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_NumOfJob_ByEducation_FiveProvinces.print_Unique('Education_group')

training_df_NumOfJob_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(training_df_NumOfJob_ByImmigrant_Provinces, pd, np, pp)
training_df_NumOfJob_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_NumOfJob_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(testing_df_NumOfJob_ByImmigrant_Provinces, pd, np, pp)
testing_df_NumOfJob_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_NumOfJob_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_NumOfJob_ByImmigrant_FiveProvinces.print_Unique('Immigrant_status')

# training_df_NumOfJobs_ByIndigenous_FiveProvinces = FiveProvinceAnalysis(training_df_NumOfJobs_ByIndigenous_Provinces, pd, np, pp)
# training_df_NumOfJobs_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
# testing_df_NumOfJobs_ByIndigenous_FiveProvinces = FiveProvinceAnalysis(training_df_NumOfJobs_ByIndigenous_Provinces, pd, np, pp)
# testing_df_NumOfJobs_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()

<class 'pandas.core.frame.DataFrame'>
Index: 450 entries, 85134 to 100793
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   REF_DATE              450 non-null    int64  
 1   DGUID                 450 non-null    object 
 2   Sector                450 non-null    object 
 3   Characteristics       450 non-null    object 
 4   Indicators            450 non-null    object 
 5   UOM                   450 non-null    object 
 6   SCALAR_FACTOR         450 non-null    object 
 7   VALUE                 450 non-null    float64
 8   GEO_Alberta           450 non-null    bool   
 9   GEO_British Columbia  450 non-null    bool   
 10  GEO_Nova Scotia       450 non-null    bool   
 11  GEO_Ontario           450 non-null    bool   
 12  GEO_Quebec            450 non-null    bool   
 13  Age_group             450 non-null    int64  
dtypes: bool(5), float64(1), int64(2), object(6)
memory usage: 37.4+ KB
None


In [171]:
training_df_WagesAndSalaries_ByAge_FiveProvinces = FiveProvinceAnalysis(training_df_WagesAndSalaries_ByAge_Provinces, pd, np, pp)
training_df_WagesAndSalaries_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_WagesAndSalaries_ByAge_FiveProvinces = FiveProvinceAnalysis(testing_df_WagesAndSalaries_ByAge_Provinces, pd, np, pp)
testing_df_WagesAndSalaries_ByAge_FiveProvinces.convertCategoricalToNumericValue(0)
testing_df_WagesAndSalaries_ByAge_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_WagesAndSalaries_ByAge_FiveProvinces.print_Unique('Age_group')

training_df_WagesAndSalaries_ByGender_FiveProvinces = FiveProvinceAnalysis(training_df_WagesAndSalaries_ByGender_Provinces, pd, np, pp)
training_df_WagesAndSalaries_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_WagesAndSalaries_ByGender_FiveProvinces = FiveProvinceAnalysis(testing_df_WagesAndSalaries_ByGender_Provinces, pd, np, pp)
testing_df_WagesAndSalaries_ByGender_FiveProvinces.convertCategoricalToNumericValue(1)
testing_df_WagesAndSalaries_ByGender_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_WagesAndSalaries_ByGender_FiveProvinces.print_Unique('Gender_group')

training_df_WagesAndSalaries_ByEducation_FiveProvinces = FiveProvinceAnalysis(training_df_WagesAndSalaries_ByEducation_Provinces, pd, np, pp)
training_df_WagesAndSalaries_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_WagesAndSalaries_ByEducation_FiveProvinces = FiveProvinceAnalysis(testing_df_WagesAndSalaries_ByEducation_Provinces, pd, np, pp)
testing_df_WagesAndSalaries_ByEducation_FiveProvinces.convertCategoricalToNumericValue(2)
testing_df_WagesAndSalaries_ByEducation_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_WagesAndSalaries_ByEducation_FiveProvinces.print_Unique('Education_group')

training_df_WagesAndSalaries_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(training_df_WagesAndSalaries_ByImmigrant_Provinces, pd, np, pp)
training_df_WagesAndSalaries_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces = FiveProvinceAnalysis(testing_df_WagesAndSalaries_ByImmigrant_Provinces, pd, np, pp)
testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces.convertCategoricalToNumericValue(3)
testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces.print_One_Hot_Encoded_Data_Info()
testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces.print_Unique('Immigrant_status')

# training_df_WagesAndSalaries_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_WagesAndSalaries_ByIndigenous_Provinces, pd, np, pp)
# training_df_WagesAndSalaries_ByImmigrant_FiveProvince.print_One_Hot_Encoded_Data_Info()
# testing_df_WagesAndSalaries_ByIndigenous_FiveProvince = FiveProvinceAnalysis(training_df_WagesAndSalaries_ByIndigenous_Provinces, pd, np, pp)

<class 'pandas.core.frame.DataFrame'>
Index: 450 entries, 85136 to 100795
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   REF_DATE              450 non-null    int64  
 1   DGUID                 450 non-null    object 
 2   Sector                450 non-null    object 
 3   Characteristics       450 non-null    object 
 4   Indicators            450 non-null    object 
 5   UOM                   450 non-null    object 
 6   SCALAR_FACTOR         450 non-null    object 
 7   VALUE                 450 non-null    float64
 8   GEO_Alberta           450 non-null    bool   
 9   GEO_British Columbia  450 non-null    bool   
 10  GEO_Nova Scotia       450 non-null    bool   
 11  GEO_Ontario           450 non-null    bool   
 12  GEO_Quebec            450 non-null    bool   
 13  Age_group             450 non-null    int64  
dtypes: bool(5), float64(1), int64(2), object(6)
memory usage: 37.4+ KB
None


<h3> Final part - will be performed in other notebook files </h3>

Saving into CSV files to replay this result in case needed.

In [172]:
# Save the dataframe to a CSV file

training_df_AvgAnnHrsWrk_ByAge.to_csv('Result_By_Provinces/training_df_AvgAnnHrsWrk_ByAge.csv', index=False) # Average annual hours worked
testing_df_AvgAnnHrsWrk_ByAge.to_csv('Result_By_Provinces/testing_df_AvgAnnHrsWrk_ByAge.csv', index=False)

training_df_AvgAnnHrsWrk_ByGender.to_csv('Result_By_Provinces/training_df_AvgAnnHrsWrk_ByGender.csv', index=False) # Average annual hours worked
testing_df_AvgAnnHrsWrk_ByGender.to_csv('Result_By_Provinces/testing_df_AvgAnnHrsWrk_ByGender.csv', index=False)

training_df_AvgAnnHrsWrk_ByEducation.to_csv('Result_By_Provinces/training_df_AvgAnnHrsWrk_ByEducation.csv', index=False) # Average annual hours worked
testing_df_AvgAnnHrsWrk_ByEducation.to_csv('Result_By_Provinces/testing_df_AvgAnnHrsWrk_ByEducation.csv', index=False)

training_df_AvgAnnHrsWrk_ByImmigrant.to_csv('Result_By_Provinces/training_df_AvgAnnHrsWrk_ByImmigrant.csv', index=False) # Average annual hours worked
testing_df_AvgAnnHrsWrk_ByImmigrant.to_csv('Result_By_Provinces/testing_df_AvgAnnHrsWrk_ByImmigrant.csv', index=False)
# End of Loop and will start next one or end here.

In [173]:
training_df_AvgAnnWages_ByAge.to_csv('Result_By_Provinces/training_df_AvgAnnWages_ByAge.csv', index=False) # Average annual hours worked
testing_df_AvgAnnWages_ByAge.to_csv('Result_By_Provinces/testing_df_AvgAnnWages_ByAge.csv', index=False)

training_df_AvgAnnWages_ByGender.to_csv('Result_By_Provinces/training_df_AvgAnnWages_ByGender.csv', index=False) # Average annual hours worked
testing_df_AvgAnnWages_ByGender.to_csv('Result_By_Provinces/testing_df_AvgAnnWages_ByGender.csv', index=False)

training_df_AvgAnnWages_ByEducation.to_csv('Result_By_Provinces/training_df_AvgAnnWages_ByEducation.csv', index=False) # Average annual hours worked
testing_df_AvgAnnWages_ByEducation.to_csv('Result_By_Provinces/testing_df_AvgAnnWages_ByEducation.csv', index=False)

training_df_AvgAnnWages_ByImmigrant.to_csv('Result_By_Provinces/training_df_AvgAnnWages_ByImmigrant.csv', index=False) # Average annual hours worked
testing_df_AvgAnnWages_ByImmigrant.to_csv('Result_By_Provinces/testing_df_AvgAnnWages_ByImmigrant.csv', index=False)
# End of Loop and will start next one or end here.

In [174]:
training_df_AvgHrsWages_ByAge.to_csv('Result_By_Provinces/training_df_AvgHrsWages_ByAge.csv', index=False) # Average annual hours worked
testing_df_AvgHrsWages_ByAge.to_csv('Result_By_Provinces/testing_df_AvgHrsWages_ByAge.csv', index=False)

training_df_AvgHrsWages_ByGender.to_csv('Result_By_Provinces/training_df_AvgHrsWages_ByGender.csv', index=False) # Average annual hours worked
testing_df_AvgHrsWages_ByGender.to_csv('Result_By_Provinces/testing_df_AvgHrsWages_ByGender.csv', index=False)

training_df_AvgHrsWages_ByEducation.to_csv('Result_By_Provinces/training_df_AvgHrsWages_ByEducation.csv', index=False) # Average annual hours worked
testing_df_AvgHrsWages_ByEducation.to_csv('Result_By_Provinces/testing_df_AvgHrsWages_ByEducation.csv', index=False)

training_df_AvgHrsWages_ByImmigrant.to_csv('Result_By_Provinces/training_df_AvgHrsWages_ByImmigrant.csv', index=False) # Average annual hours worked
testing_df_AvgHrsWages_ByImmigrant.to_csv('Result_By_Provinces/testing_df_AvgHrsWages_ByImmigrant.csv', index=False)
# End of Loop and will start next one or end here.

In [175]:
training_df_AvgWeekHrsWrked_ByAge.to_csv('Result_By_Provinces/training_df_AvgWeekHrsWrked_ByAge.csv', index=False) # Average annual hours worked
testing_df_AvgWeekHrsWrked_ByAge.to_csv('Result_By_Provinces/testing_df_AvgWeekHrsWrked_ByAge.csv', index=False)

training_df_AvgWeekHrsWrked_ByGender.to_csv('Result_By_Provinces/training_df_AvgWeekHrsWrked_ByGender.csv', index=False) # Average annual hours worked
testing_df_AvgWeekHrsWrked_ByGender.to_csv('Result_By_Provinces/testing_df_AvgWeekHrsWrked_ByGender.csv', index=False)

training_df_AvgWeekHrsWrked_ByEducation.to_csv('Result_By_Provinces/training_df_AvgWeekHrsWrked_ByEducation.csv', index=False) # Average annual hours worked
testing_df_AvgWeekHrsWrked_ByEducation.to_csv('Result_By_Provinces/testing_df_AvgWeekHrsWrked_ByEducation.csv', index=False)

training_df_AvgWeekHrsWrked_ByImmigrant.to_csv('Result_By_Provinces/training_df_AvgWeekHrsWrked_ByImmigrant.csv', index=False) # Average annual hours worked
testing_df_AvgWeekHrsWrked_ByImmigrant.to_csv('Result_By_Provinces/testing_df_AvgWeekHrsWrked_ByImmigrant.csv', index=False)
# End of Loop and will start next one or end here.

In [176]:
training_df_Hrs_Wrked_ByAge.to_csv('Result_By_Provinces/training_df_Hrs_Wrked_ByAge.csv', index=False) # Average annual hours worked
testing_df_Hrs_Wrked_ByAge.to_csv('Result_By_Provinces/testing_df_Hrs_Wrked_ByAge.csv', index=False)

training_df_Hrs_Wrked_ByGender.to_csv('Result_By_Provinces/training_df_Hrs_Wrked_ByGender.csv', index=False) # Average annual hours worked
testing_df_Hrs_Wrked_ByGender.to_csv('Result_By_Provinces/testing_df_Hrs_Wrked_ByGender.csv', index=False)

training_df_Hrs_Wrked_ByEducation.to_csv('Result_By_Provinces/training_df_Hrs_Wrked_ByEducation.csv', index=False) # Average annual hours worked
testing_df_Hrs_Wrked_ByEducation.to_csv('Result_By_Provinces/testing_df_Hrs_Wrked_ByEducation.csv', index=False)

training_df_Hrs_Wrked_ByImmigrant.to_csv('Result_By_Provinces/training_df_Hrs_Wrked_ByImmigrant.csv', index=False) # Average annual hours worked
testing_df_Hrs_Wrked_ByImmigrant.to_csv('Result_By_Provinces/testing_df_Hrs_Wrked_ByImmigrant.csv', index=False)
# End of Loop and will start next one or end here.

In [177]:
training_df_NumOfJob_ByAge.to_csv('Result_By_Provinces/training_df_NumOfJob_ByAge.csv', index=False) # Average annual hours worked
testing_df_NumOfJob_ByAge.to_csv('Result_By_Provinces/testing_df_NumOfJob_ByAge.csv', index=False)

training_df_NumOfJob_ByGender.to_csv('Result_By_Provinces/training_df_NumOfJob_ByGender.csv', index=False) # Average annual hours worked
testing_df_NumOfJob_ByGender.to_csv('Result_By_Provinces/testing_df_NumOfJob_ByGender.csv', index=False)

training_df_NumOfJob_ByEducation.to_csv('Result_By_Provinces/training_df_NumOfJob_ByEducation.csv', index=False) # Average annual hours worked
testing_df_NumOfJob_ByEducation.to_csv('Result_By_Provinces/testing_df_NumOfJob_ByEducation.csv', index=False)

training_df_NumOfJob_ByImmigrant.to_csv('Result_By_Provinces/training_df_NumOfJob_ByImmigrant.csv', index=False) # Average annual hours worked
testing_df_NumOfJob_ByImmigrant.to_csv('Result_By_Provinces/testing_df_NumOfJob_ByImmigrant.csv', index=False)
# End of Loop and will start next one or end here.

In [178]:
training_df_WagesAndSalaries_ByAge.to_csv('Result_By_Provinces/training_df_WagesAndSalaries_ByAge.csv', index=False) # Average annual hours worked
testing_df_WagesAndSalaries_ByAge.to_csv('Result_By_Provinces/testing_df_WagesAndSalaries_ByAge.csv', index=False)

training_df_WagesAndSalaries_ByGender.to_csv('Result_By_Provinces/training_df_WagesAndSalaries_ByGender.csv', index=False) # Average annual hours worked
testing_df_WagesAndSalaries_ByGender.to_csv('Result_By_Provinces/testing_df_WagesAndSalaries_ByGender.csv', index=False)

training_df_WagesAndSalaries_ByEducation.to_csv('Result_By_Provinces/training_df_WagesAndSalaries_ByEducation.csv', index=False) # Average annual hours worked
testing_df_WagesAndSalaries_ByEducation.to_csv('Result_By_Provinces/testing_df_WagesAndSalaries_ByEducation.csv', index=False)

training_df_WagesAndSalaries_ByImmigrant.to_csv('Result_By_Provinces/training_df_WagesAndSalaries_ByImmigrant.csv', index=False) # Average annual hours worked
testing_df_WagesAndSalaries_ByImmigrant.to_csv('Result_By_Provinces/testing_df_WagesAndSalaries_ByImmigrant.csv', index=False)
# End of Loop and will start next one or end here.

Directory that deal with the final result.

In [179]:
CreatedTheFile = toOrganizedOutputFiles('Final_Result')

Directory 'Final_Result' is ALREADY created


In [180]:
# Result of my final analysis.
# Commented because it is not needed. Further analysis contain inside Final_Result/'Portion_Technical_Report_Final_Select.ipynb'
# Also this code required to run inside 'Final_Result' directory

# testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgAnnHrsWrk_ByGender_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.print_AnalysisResult_ByCharacteristics()

# testing_df_AvgAnnWages_ByAge_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgAnnWages_ByGender_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgAnnWages_ByEducation_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgAnnWages_ByImmigrant_FiveProvinces.print_AnalysisResult_ByCharacteristics()

# testing_df_AvgWeekHrsWrked_ByAge_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgWeekHrsWrked_ByGender_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.print_AnalysisResult_ByCharacteristics()

# testing_df_Hrs_Wrked_ByAge_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_Hrs_Wrked_ByGender_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_Hrs_Wrked_ByEducation_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_Hrs_Wrked_ByImmigrant_FiveProvinces.print_AnalysisResult_ByCharacteristics()

# testing_df_NumOfJob_ByAge_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_NumOfJob_ByGender_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_NumOfJob_ByEducation_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_NumOfJob_ByImmigrant_FiveProvinces.print_AnalysisResult_ByCharacteristics()


# testing_df_WagesAndSalaries_ByAge_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_WagesAndSalaries_ByGender_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_WagesAndSalaries_ByEducation_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces.print_AnalysisResult_ByCharacteristics()

# testing_df_WagesAndSalaries_ByAge_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_WagesAndSalaries_ByGender_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_WagesAndSalaries_ByEducation_FiveProvinces.print_AnalysisResult_ByCharacteristics()
# testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces.print_AnalysisResult_ByCharacteristics()



In [181]:
print("Further analysis contain inside Final_Result/\'Portion_Technical_Report_Final_Select.ipynb\'")
print("Also this code required to run inside 'Final_Result' directory")

Further analysis contain inside Final_Result/'Portion_Technical_Report_Final_Select.ipynb'
Also this code required to run inside 'Final_Result' directory


Saving Final Result dataset into csv files. Starting from training dataset.

In [182]:
# Backing up modified training dataset with numeric orders with five Provinces

file_training_df_output_df_AvgAnnHrsWrk_ByAge = training_df_AvgAnnHrsWrk_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgAnnHrsWrk_ByEducation = training_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgAnnHrsWrk_ByGender = training_df_AvgAnnHrsWrk_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgAnnHrsWrk_ByImmigrant = training_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_training_df_output_df_AvgAnnWages_ByAge = training_df_AvgAnnWages_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgAnnWages_ByEducation = training_df_AvgAnnWages_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgAnnWages_ByGender = training_df_AvgAnnWages_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgAnnWages_ByImmigrant = training_df_AvgAnnWages_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_training_df_output_df_AvgHrsWages_ByAge = training_df_AvgHrsWages_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgHrsWages_ByGender = training_df_AvgHrsWages_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgHrsWages_ByEducation = training_df_AvgHrsWages_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgHrsWages_ByImmigrant = training_df_AvgHrsWages_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_training_df_output_df_AvgWeekHrsWrked_ByAge = training_df_AvgWeekHrsWrked_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgWeekHrsWrked_ByEducation = training_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgWeekHrsWrked_ByGender = training_df_AvgWeekHrsWrked_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_AvgWeekHrsWrked_ByImmigrant = training_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_training_df_output_df_Hrs_Wrked_ByAge = training_df_Hrs_Wrked_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_Hrs_Wrked_ByEducation = training_df_Hrs_Wrked_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_Hrs_Wrked_ByGender = training_df_Hrs_Wrked_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_Hrs_Wrked_ByImmigrant = training_df_Hrs_Wrked_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_training_df_output_df_NumOfJob_ByAge = training_df_NumOfJob_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_NumOfJob_ByEducation = training_df_NumOfJob_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_NumOfJob_ByGender = training_df_NumOfJob_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_NumOfJob_ByImmigrant = training_df_NumOfJob_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_training_df_output_df_WagesAndSalaries_ByAge = training_df_WagesAndSalaries_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_WagesAndSalaries_ByEducation = training_df_WagesAndSalaries_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_WagesAndSalaries_ByGender = training_df_WagesAndSalaries_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_training_df_output_df_WagesAndSalaries_ByImmigrant = training_df_WagesAndSalaries_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

In [183]:
# Backing up original training dataset with five Provinces
# Skipping saving into csv files for this.

# file_training_df_output_df_AvgAnnHrsWrk_ByAge_Original = training_df_AvgAnnHrsWrk_ByAge_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgAnnHrsWrk_ByEducation_Original = training_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgAnnHrsWrk_ByGender_Original = training_df_AvgAnnHrsWrk_ByGender_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgAnnHrsWrk_ByImmigrant_Original = training_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.output_Original_Data()

# file_training_df_output_df_AvgAnnWages_ByAge_Original = training_df_AvgAnnWages_ByAge_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgAnnWages_ByEducation_Original = training_df_AvgAnnWages_ByEducation_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgAnnWages_ByGender_Original = training_df_AvgAnnWages_ByGender_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgAnnWages_ByImmigrant_Original = training_df_AvgAnnWages_ByImmigrant_FiveProvinces.output_Original_Data()

# file_training_df_output_df_AvgHrsWages_ByAge_Original = training_df_AvgHrsWages_ByAge_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgHrsWages_ByGender_Original = training_df_AvgHrsWages_ByGender_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgHrsWages_ByEducation_Original = training_df_AvgHrsWages_ByEducation_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgHrsWages_ByImmigrant_Original = training_df_AvgHrsWages_ByImmigrant_FiveProvinces.output_Original_Data()

# file_training_df_output_df_AvgWeekHrsWrked_ByAge_Original = training_df_AvgWeekHrsWrked_ByAge_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgWeekHrsWrked_ByEducation_Original = training_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgWeekHrsWrked_ByGender_Original = training_df_AvgWeekHrsWrked_ByGender_FiveProvinces.output_Original_Data()
# file_training_df_output_df_AvgWeekHrsWrked_ByImmigrant_Original = training_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.output_Original_Data()

# file_training_df_output_df_Hrs_Wrked_ByAge_Original = training_df_Hrs_Wrked_ByAge_FiveProvinces.output_Original_Data()
# file_training_df_output_df_Hrs_Wrked_ByEducation_Original = training_df_Hrs_Wrked_ByEducation_FiveProvinces.output_Original_Data()
# file_training_df_output_df_Hrs_Wrked_ByGender_Original = training_df_Hrs_Wrked_ByGender_FiveProvinces.output_Original_Data()
# file_training_df_output_df_Hrs_Wrked_ByImmigrant_Original = training_df_Hrs_Wrked_ByImmigrant_FiveProvinces.output_Original_Data()

# file_training_df_output_df_NumOfJob_ByAge_Original = training_df_NumOfJob_ByAge_FiveProvinces.output_Original_Data()
# file_training_df_output_df_NumOfJob_ByEducation_Original = training_df_NumOfJob_ByEducation_FiveProvinces.output_Original_Data()
# file_training_df_output_df_NumOfJob_ByGender_Original = training_df_NumOfJob_ByGender_FiveProvinces.output_Original_Data()
# file_training_df_output_df_NumOfJob_ByImmigrant_Original = training_df_NumOfJob_ByImmigrant_FiveProvinces.output_Original_Data()

# file_training_df_output_df_WagesAndSalaries_ByAge_Original = training_df_WagesAndSalaries_ByAge_FiveProvinces.output_Original_Data()
# file_training_df_output_df_WagesAndSalaries_ByEducation_Original = training_df_WagesAndSalaries_ByEducation_FiveProvinces.output_Original_Data()
# file_training_df_output_df_WagesAndSalaries_ByGender_Original = training_df_WagesAndSalaries_ByGender_FiveProvinces.output_Original_Data()
# file_training_df_output_df_WagesAndSalaries_ByImmigrant_Original = training_df_WagesAndSalaries_ByImmigrant_FiveProvinces.output_Original_Data()

In [184]:
# Save the dataframe to a CSV file

file_training_df_output_df_AvgAnnHrsWrk_ByAge.to_csv('Final_Result/final_training_df_output_df_AvgAnnHrsWrk_ByAge.csv', index=False)
file_training_df_output_df_AvgAnnHrsWrk_ByEducation.to_csv('Final_Result/final_training_df_output_df_AvgAnnHrsWrk_ByEducation.csv', index=False)
file_training_df_output_df_AvgAnnHrsWrk_ByGender.to_csv('Final_Result/final_training_df_output_df_AvgAnnHrsWrk_ByGender.csv', index=False)
file_training_df_output_df_AvgAnnHrsWrk_ByImmigrant.to_csv('Final_Result/final_training_df_output_df_AvgAnnHrsWrk_ByImmigrant.csv', index=False)

file_training_df_output_df_AvgAnnWages_ByAge.to_csv('Final_Result/final_training_df_output_df_AvgAnnWages_ByAge.csv', index=False)
file_training_df_output_df_AvgAnnWages_ByEducation.to_csv('Final_Result/final_training_df_output_df_AvgAnnWages_ByEducation.csv', index=False)
file_training_df_output_df_AvgAnnWages_ByGender.to_csv('Final_Result/final_training_df_output_df_AvgAnnWages_ByGender.csv', index=False)
file_training_df_output_df_AvgAnnWages_ByImmigrant.to_csv('Final_Result/final_training_df_output_df_AvgAnnWages_ByImmigrant.csv', index=False)

file_training_df_output_df_AvgHrsWages_ByAge.to_csv('Final_Result/final_training_df_output_df_AvgHrsWages_ByAge.csv', index=False)
file_training_df_output_df_AvgHrsWages_ByGender.to_csv('Final_Result/final_training_df_output_df_AvgHrsWages_ByGender.csv', index=False)
file_training_df_output_df_AvgHrsWages_ByEducation.to_csv('Final_Result/final_training_df_output_df_AvgHrsWages_ByEducation.csv', index=False)
file_training_df_output_df_AvgHrsWages_ByImmigrant.to_csv('Final_Result/final_training_df_output_df_AvgHrsWages_ByImmigrant.csv', index=False)

file_training_df_output_df_AvgWeekHrsWrked_ByAge.to_csv('Final_Result/final_training_df_output_df_AvgWeekHrsWrked_ByAge.csv', index=False)
file_training_df_output_df_AvgWeekHrsWrked_ByEducation.to_csv('Final_Result/final_training_df_output_df_AvgWeekHrsWrked_ByEducation.csv', index=False)
file_training_df_output_df_AvgWeekHrsWrked_ByGender.to_csv('Final_Result/final_training_df_output_df_AvgWeekHrsWrked_ByGender.csv', index=False)
file_training_df_output_df_AvgWeekHrsWrked_ByImmigrant.to_csv('Final_Result/final_training_df_output_df_AvgWeekHrsWrked_ByImmigrant.csv', index=False)

file_training_df_output_df_Hrs_Wrked_ByAge.to_csv('Final_Result/final_training_df_output_df_Hrs_Wrked_ByAge.csv', index=False)
file_training_df_output_df_Hrs_Wrked_ByEducation.to_csv('Final_Result/final_training_df_output_df_Hrs_Wrked_ByEducation.csv', index=False)
file_training_df_output_df_Hrs_Wrked_ByGender.to_csv('Final_Result/final_training_df_output_df_Hrs_Wrked_ByGender.csv', index=False)
file_training_df_output_df_Hrs_Wrked_ByImmigrant.to_csv('Final_Result/final_training_df_output_df_Hrs_Wrked_ByImmigrant.csv', index=False)

file_training_df_output_df_NumOfJob_ByAge.to_csv('Final_Result/final_training_df_output_df_NumOfJob_ByAge.csv', index=False)
file_training_df_output_df_NumOfJob_ByEducation.to_csv('Final_Result/final_training_df_output_df_NumOfJob_ByEducation.csv', index=False)
file_training_df_output_df_NumOfJob_ByGender.to_csv('Final_Result/final_training_df_output_df_NumOfJob_ByGender.csv', index=False)
file_training_df_output_df_NumOfJob_ByImmigrant.to_csv('Final_Result/final_training_df_output_df_NumOfJob_ByImmigrant.csv', index=False)

file_training_df_output_df_WagesAndSalaries_ByAge.to_csv('Final_Result/final_training_df_output_df_WagesAndSalaries_ByAge.csv', index=False)
file_training_df_output_df_WagesAndSalaries_ByEducation.to_csv('Final_Result/final_training_df_output_df_WagesAndSalaries_ByEducation.csv', index=False)
file_training_df_output_df_WagesAndSalaries_ByGender.to_csv('Final_Result/final_training_df_output_df_WagesAndSalaries_ByGender.csv', index=False)
file_training_df_output_df_WagesAndSalaries_ByImmigrant.to_csv('Final_Result/final_training_df_output_df_WagesAndSalaries_ByImmigrant.csv', index=False)

Saving Final Result dataset into csv files. Next testing dataset.

In [185]:
# Backing up modified testing dataset dataset with numeric orders with five Provinces

file_testing_df_output_df_AvgAnnHrsWrk_ByAge = testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgAnnHrsWrk_ByEducation = testing_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgAnnHrsWrk_ByGender = testing_df_AvgAnnHrsWrk_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgAnnHrsWrk_ByImmigrant = testing_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_testing_df_output_df_AvgAnnWages_ByAge = testing_df_AvgAnnWages_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgAnnWages_ByEducation = testing_df_AvgAnnWages_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgAnnWages_ByGender = testing_df_AvgAnnWages_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgAnnWages_ByImmigrant = testing_df_AvgAnnWages_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_testing_df_output_df_AvgHrsWages_ByAge = testing_df_AvgHrsWages_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgHrsWages_ByGender = testing_df_AvgHrsWages_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgHrsWages_ByEducation = testing_df_AvgHrsWages_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgHrsWages_ByImmigrant = testing_df_AvgHrsWages_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_testing_df_output_df_AvgWeekHrsWrked_ByAge = testing_df_AvgWeekHrsWrked_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgWeekHrsWrked_ByEducation = testing_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgWeekHrsWrked_ByGender = testing_df_AvgWeekHrsWrked_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_AvgWeekHrsWrked_ByImmigrant = testing_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_testing_df_output_df_Hrs_Wrked_ByAge = testing_df_Hrs_Wrked_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_Hrs_Wrked_ByEducation = testing_df_Hrs_Wrked_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_Hrs_Wrked_ByGender = testing_df_Hrs_Wrked_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_Hrs_Wrked_ByImmigrant = testing_df_Hrs_Wrked_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_testing_df_output_df_NumOfJob_ByAge = testing_df_NumOfJob_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_NumOfJob_ByEducation = testing_df_NumOfJob_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_NumOfJob_ByGender = testing_df_NumOfJob_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_NumOfJob_ByImmigrant = testing_df_NumOfJob_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

file_testing_df_output_df_WagesAndSalaries_ByAge = testing_df_WagesAndSalaries_ByAge_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_WagesAndSalaries_ByEducation = testing_df_WagesAndSalaries_ByEducation_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_WagesAndSalaries_ByGender = testing_df_WagesAndSalaries_ByGender_FiveProvinces.output_One_Hot_Encoded_Data()
file_testing_df_output_df_WagesAndSalaries_ByImmigrant = testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces.output_One_Hot_Encoded_Data()

In [186]:
# Backing up original testing dataset with five Provinces
# Skipping saving into csv files for this.

# file_testing_df_output_df_AvgAnnHrsWrk_ByAge_Original = testing_df_AvgAnnHrsWrk_ByAge_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgAnnHrsWrk_ByEducation_Original = testing_df_AvgAnnHrsWrk_ByEducation_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgAnnHrsWrk_ByGender_Original = testing_df_AvgAnnHrsWrk_ByGender_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgAnnHrsWrk_ByImmigrant_Original = testing_df_AvgAnnHrsWrk_ByImmigrant_FiveProvinces.output_Original_Data()

# file_testing_df_output_df_AvgAnnWages_ByAge_Original = testing_df_AvgAnnWages_ByAge_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgAnnWages_ByEducation_Original = testing_df_AvgAnnWages_ByEducation_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgAnnWages_ByGender_Original = testing_df_AvgAnnWages_ByGender_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgAnnWages_ByImmigrant_Original = testing_df_AvgAnnWages_ByImmigrant_FiveProvinces.output_Original_Data()

# file_testing_df_output_df_AvgHrsWages_ByAge_Original = testing_df_AvgHrsWages_ByAge_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgHrsWages_ByGender_Original = testing_df_AvgHrsWages_ByGender_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgHrsWages_ByEducation_Original = testing_df_AvgHrsWages_ByEducation_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgHrsWages_ByImmigrant_Original = testing_df_AvgHrsWages_ByImmigrant_FiveProvinces.output_Original_Data()

# file_testing_df_output_df_AvgWeekHrsWrked_ByAge_Original = testing_df_AvgWeekHrsWrked_ByAge_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgWeekHrsWrked_ByEducation_Original = testing_df_AvgWeekHrsWrked_ByEducation_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgWeekHrsWrked_ByGender_Original = testing_df_AvgWeekHrsWrked_ByGender_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_AvgWeekHrsWrked_ByImmigrant_Original = testing_df_AvgWeekHrsWrked_ByImmigrant_FiveProvinces.output_Original_Data()

# file_testing_df_output_df_Hrs_Wrked_ByAge_Original = testing_df_Hrs_Wrked_ByAge_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_Hrs_Wrked_ByEducation_Original = testing_df_Hrs_Wrked_ByEducation_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_Hrs_Wrked_ByGender_Original = testing_df_Hrs_Wrked_ByGender_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_Hrs_Wrked_ByImmigrant_Original = testing_df_Hrs_Wrked_ByImmigrant_FiveProvinces.output_Original_Data()

# file_testing_df_output_df_NumOfJob_ByAge_Original = testing_df_NumOfJob_ByAge_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_NumOfJob_ByEducation_Original = testing_df_NumOfJob_ByEducation_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_NumOfJob_ByGender_Original = testing_df_NumOfJob_ByGender_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_NumOfJob_ByImmigrant_Original = testing_df_NumOfJob_ByImmigrant_FiveProvinces.output_Original_Data()

# file_testing_df_output_df_WagesAndSalaries_ByAge_Original = testing_df_WagesAndSalaries_ByAge_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_WagesAndSalaries_ByEducation_Original = testing_df_WagesAndSalaries_ByEducation_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_WagesAndSalaries_ByGender_Original = testing_df_WagesAndSalaries_ByGender_FiveProvinces.output_Original_Data()
# file_testing_df_output_df_WagesAndSalaries_ByImmigrant_Original = testing_df_WagesAndSalaries_ByImmigrant_FiveProvinces.output_Original_Data()

In [187]:
# Save the dataframe to a CSV file

file_testing_df_output_df_AvgAnnHrsWrk_ByAge.to_csv('Final_Result/final_testing_df_output_df_AvgAnnHrsWrk_ByAge.csv', index=False)
file_testing_df_output_df_AvgAnnHrsWrk_ByEducation.to_csv('Final_Result/final_testing_df_output_df_AvgAnnHrsWrk_ByEducation.csv', index=False)
file_testing_df_output_df_AvgAnnHrsWrk_ByGender.to_csv('Final_Result/final_testing_df_output_df_AvgAnnHrsWrk_ByGender.csv', index=False)
file_testing_df_output_df_AvgAnnHrsWrk_ByImmigrant.to_csv('Final_Result/final_testing_df_output_df_AvgAnnHrsWrk_ByImmigrant.csv', index=False)

file_testing_df_output_df_AvgAnnWages_ByAge.to_csv('Final_Result/final_testing_df_output_df_AvgAnnWages_ByAge.csv', index=False)
file_testing_df_output_df_AvgAnnWages_ByEducation.to_csv('Final_Result/final_testing_df_output_df_AvgAnnWages_ByEducation.csv', index=False)
file_testing_df_output_df_AvgAnnWages_ByGender.to_csv('Final_Result/final_testing_df_output_df_AvgAnnWages_ByGender.csv', index=False)
file_testing_df_output_df_AvgAnnWages_ByImmigrant.to_csv('Final_Result/final_testing_df_output_df_AvgAnnWages_ByImmigrant.csv', index=False)

file_testing_df_output_df_AvgHrsWages_ByAge.to_csv('Final_Result/final_testing_df_output_df_AvgHrsWages_ByAge.csv', index=False)
file_testing_df_output_df_AvgHrsWages_ByGender.to_csv('Final_Result/final_testing_df_output_df_AvgHrsWages_ByGender.csv', index=False)
file_testing_df_output_df_AvgHrsWages_ByEducation.to_csv('Final_Result/final_testing_df_output_df_AvgHrsWages_ByEducation.csv', index=False)
file_testing_df_output_df_AvgHrsWages_ByImmigrant.to_csv('Final_Result/final_testing_df_output_df_AvgHrsWages_ByImmigrant.csv', index=False)

file_testing_df_output_df_AvgWeekHrsWrked_ByAge.to_csv('Final_Result/final_testing_df_output_df_AvgWeekHrsWrked_ByAge.csv', index=False)
file_testing_df_output_df_AvgWeekHrsWrked_ByEducation.to_csv('Final_Result/final_testing_df_output_df_AvgWeekHrsWrked_ByEducation.csv', index=False)
file_testing_df_output_df_AvgWeekHrsWrked_ByGender.to_csv('Final_Result/final_testing_df_output_df_AvgWeekHrsWrked_ByGender.csv', index=False)
file_testing_df_output_df_AvgWeekHrsWrked_ByImmigrant.to_csv('Final_Result/final_testing_df_output_df_AvgWeekHrsWrked_ByImmigrant.csv', index=False)

file_testing_df_output_df_Hrs_Wrked_ByAge.to_csv('Final_Result/final_testing_df_output_df_Hrs_Wrked_ByAge.csv', index=False)
file_testing_df_output_df_Hrs_Wrked_ByEducation.to_csv('Final_Result/final_testing_df_output_df_Hrs_Wrked_ByEducation.csv', index=False)
file_testing_df_output_df_Hrs_Wrked_ByGender.to_csv('Final_Result/final_testing_df_output_df_Hrs_Wrked_ByGender.csv', index=False)
file_testing_df_output_df_Hrs_Wrked_ByImmigrant.to_csv('Final_Result/final_testing_df_output_df_Hrs_Wrked_ByImmigrant.csv', index=False)

file_testing_df_output_df_NumOfJob_ByAge.to_csv('Final_Result/final_testing_df_output_df_NumOfJob_ByAge.csv', index=False)
file_testing_df_output_df_NumOfJob_ByEducation.to_csv('Final_Result/final_testing_df_output_df_NumOfJob_ByEducation.csv', index=False)
file_testing_df_output_df_NumOfJob_ByGender.to_csv('Final_Result/final_testing_df_output_df_NumOfJob_ByGender.csv', index=False)
file_testing_df_output_df_NumOfJob_ByImmigrant.to_csv('Final_Result/final_testing_df_output_df_NumOfJob_ByImmigrant.csv', index=False)

file_testing_df_output_df_WagesAndSalaries_ByAge.to_csv('Final_Result/final_testing_df_output_df_WagesAndSalaries_ByAge.csv', index=False)
file_testing_df_output_df_WagesAndSalaries_ByEducation.to_csv('Final_Result/final_testing_df_output_df_WagesAndSalaries_ByEducation.csv', index=False)
file_testing_df_output_df_WagesAndSalaries_ByGender.to_csv('Final_Result/final_testing_df_output_df_WagesAndSalaries_ByGender.csv', index=False)
file_testing_df_output_df_WagesAndSalaries_ByImmigrant.to_csv('Final_Result/final_testing_df_output_df_WagesAndSalaries_ByImmigrant.csv', index=False)


### End of this section ###

End of the analysis, final result will be performed in different notebook.