# Exploratory Data Analysis (EDA) example

This notebook will run through some of the basic analyses that you would typically do during EDA to get to know your data better. In this example, we will use data on the daily temperature in major global cities that can be downloaded from here: 
https://www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major-cities

The data were originally sourced from the National Climatic Data Center, compiled by the University of Dayton. More info and data descriptions and documentation can be found here:
https://academic.udayton.edu/kissock/http/Weather/default.htm

I've added the `city_temperature.csv` file to the `data` folder in the course repository.

In [3]:
# load in the packages you will need for your analysis

import pandas as pd  # pandas is a package for data analysis and manipulation
# import seaborn as sn  # seaborn is a data visualization package

In [4]:
# load in the csv file with cities temperatures

# set the path to the correct directory for your particular computer
infile = '../city_temperature.csv'

data = pd.read_csv(infile, sep = ',')

  data = pd.read_csv(infile, sep = ',')


In [6]:
# useful functions for exploring data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906327 entries, 0 to 2906326
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   Region          object 
 1   Country         object 
 2   State           object 
 3   City            object 
 4   Month           int64  
 5   Day             int64  
 6   Year            int64  
 7   AvgTemperature  float64
dtypes: float64(1), int64(3), object(4)
memory usage: 177.4+ MB


In [10]:
data.shape

(2906327, 8)

In [11]:
data.size

23250616

In [12]:
data.columns

Index(['Region', 'Country', 'State', 'City', 'Month', 'Day', 'Year',
       'AvgTemperature'],
      dtype='object')

In [13]:
# descriptive statistics

data.describe()

Unnamed: 0,Month,Day,Year,AvgTemperature
count,2906327.0,2906327.0,2906327.0,2906327.0
mean,6.469163,15.71682,2006.624,56.00492
std,3.456489,8.800534,23.38226,32.12359
min,1.0,0.0,200.0,-99.0
25%,3.0,8.0,2001.0,45.8
50%,6.0,16.0,2007.0,62.5
75%,9.0,23.0,2013.0,75.5
max,12.0,31.0,2020.0,110.0


In [14]:
data.corr()

Unnamed: 0,Month,Day,Year,AvgTemperature
Month,1.0,0.011209,-0.026898,0.075037
Day,0.011209,1.0,-0.002213,0.0001
Year,-0.026898,-0.002213,1.0,0.087245
AvgTemperature,0.075037,0.0001,0.087245,1.0


In [21]:
data['City'].unique()

cities = data['City'].unique()

In [24]:
print(cities[99])

Bucharest


In [14]:
# load in the data using pandas

infile = '../data/city_temperature.csv'
data = pd.read_csv(infile, sep = ',')

# convert from F to C
data['AvgTemperatureC'] = (data.AvgTemperature - 32) * 5/9

data.info()

  exec(code_obj, self.user_global_ns, self.user_ns)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906327 entries, 0 to 2906326
Data columns (total 9 columns):
 #   Column           Dtype  
---  ------           -----  
 0   Region           object 
 1   Country          object 
 2   State            object 
 3   City             object 
 4   Month            int64  
 5   Day              int64  
 6   Year             int64  
 7   AvgTemperature   float64
 8   AvgTemperatureC  float64
dtypes: float64(2), int64(3), object(4)
memory usage: 199.6+ MB


In [15]:
data.describe()

Unnamed: 0,Month,Day,Year,AvgTemperature,AvgTemperatureC
count,2906327.0,2906327.0,2906327.0,2906327.0,2906327.0
mean,6.469163,15.71682,2006.624,56.00492,13.33607
std,3.456489,8.800534,23.38226,32.12359,17.84644
min,1.0,0.0,200.0,-99.0,-72.77778
25%,3.0,8.0,2001.0,45.8,7.666667
50%,6.0,16.0,2007.0,62.5,16.94444
75%,9.0,23.0,2013.0,75.5,24.16667
max,12.0,31.0,2020.0,110.0,43.33333


In [25]:
data.groupby('Country')['City'].unique()

Country
Albania                                             [Tirana]
Algeria                                            [Algiers]
Argentina                                     [Buenos Aires]
Australia     [Brisbane, Canberra, Melbourne, Perth, Sydney]
Austria                                             [Vienna]
                                   ...                      
Uzbekistan                                        [Tashkent]
Venezuela                                          [Caracas]
Vietnam                                              [Hanoi]
Yugoslavia                                        [Belgrade]
Zambia                                              [Lusaka]
Name: City, Length: 125, dtype: object