# Data Exploration


## Loading and exploring the climate and CO2 data

In [2]:
# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load Temperature data Kaggle
Let's load the Kaggle climate data first. The data was retrieved from [Kaggle](https://www.kaggle.com/datasets/sevgisarac/temperature-change/?select=FAOSTAT_data_1-10-2022.csv).
The reference Period for the temperature here is also 1951–1980. Here, by country, we see the average temperature deviation per month per year. Pretty straightforward.

In [14]:
tmp_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_1-10-2022.csv")
display(tmp_df.info())
tmp_df.head()

  tmp_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_1-10-2022.csv")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229925 entries, 0 to 229924
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Domain Code       229925 non-null  object 
 1   Domain            229925 non-null  object 
 2   Area Code (FAO)   229925 non-null  int64  
 3   Area              229925 non-null  object 
 4   Element Code      229925 non-null  int64  
 5   Element           229925 non-null  object 
 6   Months Code       229925 non-null  int64  
 7   Months            229925 non-null  object 
 8   Year Code         229925 non-null  int64  
 9   Year              229925 non-null  int64  
 10  Unit              229925 non-null  object 
 11  Value             222012 non-null  float64
 12  Flag              229925 non-null  object 
 13  Flag Description  229925 non-null  object 
dtypes: float64(1), int64(5), object(8)
memory usage: 24.6+ MB


None

array(['Fc', 'NV'], dtype=object)

Okay, we can see domain code is a useless column (says only temperature change), same as "Domain Code", Element and Element Code. 
We miss approximately ~8000 Temperature deviation values. Careful a couple of country area codes might be mismatched according to the kaggle website.

#### Kaggle Data set 2


In [5]:
tmp_df1 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_11-24-2020.csv")
display(tmp_df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321 entries, 0 to 320
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  321 non-null    int64  
 1   Country       321 non-null    object 
 2   M49 Code      304 non-null    float64
 3   ISO2 Code     245 non-null    object 
 4   ISO3 Code     257 non-null    object 
 5   Start Year    39 non-null     float64
 6   End Year      9 non-null      float64
dtypes: float64(3), int64(1), object(3)
memory usage: 17.7+ KB


  tmp_df1 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_11-24-2020.csv")


None

In [6]:
tmp_df3 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\Environment_Temperature_change_E_All_Data_NOFLAG.csv",
                     encoding = 'latin-1')
display(tmp_df3.info())

  tmp_df3 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\Environment_Temperature_change_E_All_Data_NOFLAG.csv",


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9656 entries, 0 to 9655
Data columns (total 66 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Area Code     9656 non-null   int64  
 1   Area          9656 non-null   object 
 2   Months Code   9656 non-null   int64  
 3   Months        9656 non-null   object 
 4   Element Code  9656 non-null   int64  
 5   Element       9656 non-null   object 
 6   Unit          9656 non-null   object 
 7   Y1961         8287 non-null   float64
 8   Y1962         8322 non-null   float64
 9   Y1963         8294 non-null   float64
 10  Y1964         8252 non-null   float64
 11  Y1965         8281 non-null   float64
 12  Y1966         8364 non-null   float64
 13  Y1967         8347 non-null   float64
 14  Y1968         8345 non-null   float64
 15  Y1969         8326 non-null   float64
 16  Y1970         8308 non-null   float64
 17  Y1971         8303 non-null   float64
 18  Y1972         8323 non-null 

None

### Load CO2 data
The CO2 data is retrieved from the [GitHub page](https://github.com/owid/co2-data?tab=readme-ov-file) "of our World in data".

In [7]:
CO2_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\CO2_data\owid-co2-data.csv")
CO2_df.info()

  CO2_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\CO2_data\owid-co2-data.csv")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47415 entries, 0 to 47414
Data columns (total 79 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   country                                    47415 non-null  object 
 1   year                                       47415 non-null  int64  
 2   iso_code                                   39548 non-null  object 
 3   population                                 39414 non-null  float64
 4   gdp                                        14495 non-null  float64
 5   cement_co2                                 23764 non-null  float64
 6   cement_co2_per_capita                      22017 non-null  float64
 7   co2                                        30308 non-null  float64
 8   co2_growth_abs                             28157 non-null  float64
 9   co2_growth_prct                            25136 non-null  float64
 10  co2_including_luc     

### Load the NASA data files
Lastly, we load the temperature anomaly (deviation - reference period 1951-1980) which was retrieved from [NASA](https://data.giss.nasa.gov/gistemp/).