# Exploratory Data Analysis (EDA) example

This notebook will run through some of the basic analyses that you would typically do during EDA to get to know your data better. In this example, we will use data on the daily temperature in major global cities that can be downloaded from here: 
https://www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major-cities

The data were originally sourced from the National Climatic Data Center, compiled by the University of Dayton. More info and data descriptions and documentation can be found here:
https://academic.udayton.edu/kissock/http/Weather/default.htm

I've added the `city_temperature.csv` file to the `data` folder in the course repository.

In [1]:
# load in the packages you will need for your analysis

import pandas as pd  # pandas is a package for data analysis and manipulation
import seaborn as sn  # seaborn is a data visualization package

In [8]:
# load in the csv file with cities temperatures

# set the path to the correct directory for your particular computer
infile = '../data/city_temperature.csv'

# read in the .csv file with the column separator explicitly defined
data = pd.read_csv(infile, sep = ',')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# convert the temperature data from F to C
# assign to a new column in the dataframe called 'AvgTemperatureC'
data['AvgTemperatureC'] = (data.AvgTemperature - 32) * 5/9

In [9]:
# print the information about the data in the dataframe

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906327 entries, 0 to 2906326
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   Region          object 
 1   Country         object 
 2   State           object 
 3   City            object 
 4   Month           int64  
 5   Day             int64  
 6   Year            int64  
 7   AvgTemperature  float64
dtypes: float64(1), int64(3), object(4)
memory usage: 177.4+ MB


In [10]:
# the shape of the data
# nrows, ncols

data.shape

(2906327, 8)

In [11]:
# total number of datapoints in the dataframe
# ncol * nrows

data.size

23250616

In [12]:
# gives a list of the column names in the dataframe

data.columns

Index(['Region', 'Country', 'State', 'City', 'Month', 'Day', 'Year',
       'AvgTemperature'],
      dtype='object')

In [13]:
# descriptive statistics for the numeric data in the dataframe

data.describe()

Unnamed: 0,Month,Day,Year,AvgTemperature
count,2906327.0,2906327.0,2906327.0,2906327.0
mean,6.469163,15.71682,2006.624,56.00492
std,3.456489,8.800534,23.38226,32.12359
min,1.0,0.0,200.0,-99.0
25%,3.0,8.0,2001.0,45.8
50%,6.0,16.0,2007.0,62.5
75%,9.0,23.0,2013.0,75.5
max,12.0,31.0,2020.0,110.0


In [14]:
# correlations between the different numeric data columns in the dataframe

data.corr()

Unnamed: 0,Month,Day,Year,AvgTemperature
Month,1.0,0.011209,-0.026898,0.075037
Day,0.011209,1.0,-0.002213,0.0001
Year,-0.026898,-0.002213,1.0,0.087245
AvgTemperature,0.075037,0.0001,0.087245,1.0


In [15]:
# get a list of the unique values found in the 'City' column of the dataframe
data['City'].unique()

# assign the list of unique values of 'City' to a new variable called 'cities'
cities = data['City'].unique()

In [16]:
print(cities[99:112])

['Bucharest' 'Moscow' 'Yerevan' 'Pristina' 'Bratislava' 'Barcelona'
 'Bilbao' 'Madrid' 'Stockholm' 'Bern' 'Geneva' 'Zurich' 'Kiev']


In [25]:
# Group the data by a particular column
# In this case, the data is grouped by country, and I am only interested in the 'City' column

data.groupby('Country')['City'].unique()

Country
Albania                                             [Tirana]
Algeria                                            [Algiers]
Argentina                                     [Buenos Aires]
Australia     [Brisbane, Canberra, Melbourne, Perth, Sydney]
Austria                                             [Vienna]
                                   ...                      
Uzbekistan                                        [Tashkent]
Venezuela                                          [Caracas]
Vietnam                                              [Hanoi]
Yugoslavia                                        [Belgrade]
Zambia                                              [Lusaka]
Name: City, Length: 125, dtype: object