# Data Exploration


## Loading and exploring the climate and CO2 data

First we have a look of all the data sets that we were provided. We have data from Kaggle, our world in data and Nasa.

In [2]:
# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load Temperature data Kaggle
Let's load the Kaggle climate data first. The data was retrieved from [Kaggle](https://www.kaggle.com/datasets/sevgisarac/temperature-change/?select=FAOSTAT_data_1-10-2022.csv).
The reference Period for the temperature here is also 1951–1980. Here, by country, we see the average temperature deviation per month per year. Pretty straightforward.

In [14]:
tmp_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_1-10-2022.csv")
display(tmp_df.info())
tmp_df.head()

  tmp_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_1-10-2022.csv")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229925 entries, 0 to 229924
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Domain Code       229925 non-null  object 
 1   Domain            229925 non-null  object 
 2   Area Code (FAO)   229925 non-null  int64  
 3   Area              229925 non-null  object 
 4   Element Code      229925 non-null  int64  
 5   Element           229925 non-null  object 
 6   Months Code       229925 non-null  int64  
 7   Months            229925 non-null  object 
 8   Year Code         229925 non-null  int64  
 9   Year              229925 non-null  int64  
 10  Unit              229925 non-null  object 
 11  Value             222012 non-null  float64
 12  Flag              229925 non-null  object 
 13  Flag Description  229925 non-null  object 
dtypes: float64(1), int64(5), object(8)
memory usage: 24.6+ MB


None

array(['Fc', 'NV'], dtype=object)

Okay, we can see domain code is a useless column (says only temperature change), same as "Domain Code", Element and Element Code. 
We miss approximately ~8000 Temperature deviation values. Careful a couple of country area codes might be mismatched according to the kaggle website.

#### Kaggle Data Set 2
This data set only contains information pertaining to the area code. For the analysis, we can safely ignore it.

In [16]:
tmp_df1 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_11-24-2020.csv")
display(tmp_df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321 entries, 0 to 320
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  321 non-null    int64  
 1   Country       321 non-null    object 
 2   M49 Code      304 non-null    float64
 3   ISO2 Code     245 non-null    object 
 4   ISO3 Code     257 non-null    object 
 5   Start Year    39 non-null     float64
 6   End Year      9 non-null      float64
dtypes: float64(3), int64(1), object(3)
memory usage: 17.7+ KB


  tmp_df1 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\FAOSTAT_data_11-24-2020.csv")


None

#### Kaggle Data Set 3
This seems to be just a pivoted version of csv file 1. Feel free to ignore.

In [6]:
tmp_df3 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\Environment_Temperature_change_E_All_Data_NOFLAG.csv",
                     encoding = 'latin-1')
display(tmp_df3.info())

  tmp_df3 = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\Kaggle_data\Environment_Temperature_change_E_All_Data_NOFLAG.csv",


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9656 entries, 0 to 9655
Data columns (total 66 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Area Code     9656 non-null   int64  
 1   Area          9656 non-null   object 
 2   Months Code   9656 non-null   int64  
 3   Months        9656 non-null   object 
 4   Element Code  9656 non-null   int64  
 5   Element       9656 non-null   object 
 6   Unit          9656 non-null   object 
 7   Y1961         8287 non-null   float64
 8   Y1962         8322 non-null   float64
 9   Y1963         8294 non-null   float64
 10  Y1964         8252 non-null   float64
 11  Y1965         8281 non-null   float64
 12  Y1966         8364 non-null   float64
 13  Y1967         8347 non-null   float64
 14  Y1968         8345 non-null   float64
 15  Y1969         8326 non-null   float64
 16  Y1970         8308 non-null   float64
 17  Y1971         8303 non-null   float64
 18  Y1972         8323 non-null 

None

### Load CO2 data
The CO2 data is retrieved from the [GitHub page](https://github.com/owid/co2-data?tab=readme-ov-file) "of our World in data". A lot of interesting variables, but also a ton of missing data. Will be interesting to check out. Data is also only available on a yearly instead of a monthly basis. Also Something we should have a look at.

In [7]:
CO2_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\CO2_data\owid-co2-data.csv")
CO2_df.info()

  CO2_df = pd.read_csv(filepath_or_buffer ="D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\CO2_data\owid-co2-data.csv")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47415 entries, 0 to 47414
Data columns (total 79 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   country                                    47415 non-null  object 
 1   year                                       47415 non-null  int64  
 2   iso_code                                   39548 non-null  object 
 3   population                                 39414 non-null  float64
 4   gdp                                        14495 non-null  float64
 5   cement_co2                                 23764 non-null  float64
 6   cement_co2_per_capita                      22017 non-null  float64
 7   co2                                        30308 non-null  float64
 8   co2_growth_abs                             28157 non-null  float64
 9   co2_growth_prct                            25136 non-null  float64
 10  co2_including_luc     

### Load the NASA data files
Lastly, we load the temperature anomaly (deviation - reference period 1951-1980) which was retrieved from [NASA](https://data.giss.nasa.gov/gistemp/). First, we load the global mean data:

In [23]:
Global_mean_temp_df = pd.read_csv(filepath_or_buffer = r"D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\NASA\GLB.Ts+dSST.csv")
Global_mean_temp_df.head(5)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Land-Ocean: Global Means
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON
1880,-.18,-.24,-.09,-.16,-.10,-.21,-.18,-.10,-.14,-.23,-.22,-.17,-.17,***,***,-.11,-.16,-.20
1881,-.20,-.14,.03,.05,.06,-.19,.00,-.04,-.15,-.22,-.19,-.07,-.09,-.10,-.17,.05,-.08,-.19
1882,.16,.14,.05,-.17,-.15,-.24,-.17,-.08,-.14,-.24,-.16,-.36,-.11,-.09,.08,-.09,-.16,-.18
1883,-.29,-.37,-.12,-.18,-.18,-.08,-.07,-.14,-.21,-.11,-.23,-.11,-.17,-.20,-.34,-.16,-.10,-.19


From What I can gather J-D is the yearly mean from January to December, whereas D-N is the yearly mean from December to November. DJF is short for a mean of December, January, and February, and so forth. The only actual missing values that we see here come from rounding, when the temp is not in the data set.

#### NASA Northern Hemisphere mean data
Same Thing as the global data

In [25]:
Northern_mean_temp_df = pd.read_csv(filepath_or_buffer = r"D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\NASA\NH.Ts+dSST.csv")
Northern_mean_temp_df.head(5)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Land-Ocean: Northern Hemispheric Means
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON
1880,-.35,-.51,-.22,-.30,-.06,-.15,-.18,-.26,-.23,-.32,-.42,-.40,-.28,***,***,-.19,-.20,-.32
1881,-.31,-.23,-.04,.00,.03,-.34,.07,-.05,-.27,-.44,-.37,-.24,-.18,-.20,-.31,.00,-.11,-.36
1882,.26,.21,.02,-.32,-.25,-.31,-.29,-.15,-.25,-.52,-.33,-.67,-.22,-.18,.08,-.18,-.25,-.37
1883,-.57,-.66,-.15,-.29,-.25,-.13,-.04,-.22,-.33,-.16,-.42,-.15,-.28,-.32,-.64,-.23,-.13,-.30


#### Southern Hemisphere

In [26]:
Southern_mean_temp_df = pd.read_csv(filepath_or_buffer = r"D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\NASA\SH.Ts+dSST.csv")
Southern_mean_temp_df.head(5)


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Land-Ocean: Southern Hemispheric Means
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON
1880,-.01,.03,.05,-.02,-.13,-.25,-.18,.06,-.05,-.15,-.01,.05,-.05,***,***,-.04,-.12,-.07
1881,-.09,-.07,.09,.09,.08,-.06,-.08,-.03,-.04,-.01,-.01,.09,.00,-.01,-.04,.09,-.06,-.02
1882,.06,.07,.07,-.02,-.05,-.16,-.05,.00,-.04,.03,-.02,-.08,-.01,.00,.07,.00,-.07,-.01
1883,-.03,-.09,-.09,-.07,-.10,-.03,-.09,-.06,-.10,-.06,-.05,-.06,-.07,-.07,-.06,-.09,-.06,-.07


#### Zone Annual Meansabs
Annual mean Land-Ocean Temperature Index in 0.01 degrees Celsius selected zonal means. Columns probably correspond to Zones, but I am not sure how to interpret though. FYI no troubles with missing values.

In [31]:
Zone_mean_temp_df = pd.read_csv(filepath_or_buffer = r"D:\Data\Dropbox\LifeAfter\Datascientest\Climate\Data\NASA\ZonAnn.Ts+dSST.csv")
Zone_mean_temp_df.head(5)

Unnamed: 0,Year,Glob,NHem,SHem,24N-90N,24S-24N,90S-24S,64N-90N,44N-64N,24N-44N,EQU-24N,24S-EQU,44S-24S,64S-44S,90S-64S
0,1880,-0.17,-0.28,-0.05,-0.37,-0.13,-0.02,-0.82,-0.47,-0.28,-0.15,-0.11,-0.05,0.05,0.67
1,1881,-0.09,-0.18,0.0,-0.35,0.1,-0.07,-0.94,-0.46,-0.2,0.1,0.1,-0.06,-0.07,0.6
2,1882,-0.11,-0.22,-0.01,-0.32,-0.05,0.01,-1.43,-0.29,-0.14,-0.05,-0.05,0.01,0.04,0.63
3,1883,-0.17,-0.28,-0.07,-0.34,-0.17,-0.02,-0.19,-0.57,-0.24,-0.18,-0.16,-0.04,0.07,0.5
4,1884,-0.29,-0.43,-0.15,-0.61,-0.15,-0.14,-1.32,-0.65,-0.46,-0.13,-0.17,-0.2,-0.02,0.65
