# DSC350 - Week 3 - Exercise 3.2

We begin the exercises this week by importing the necessary libraries to complete them.

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from pandas.io.parsers import read_csv
import quandl
from numpy.random import seed
from numpy.random import rand
from numpy.random import randint

## Hands-On Data Analysis (2nd Edition) page 111, Exercises 1-6

Using the data/parsed.csv file and the materials from this chapter, complete the following exercises to practice your *pandas* skills.

**1. Find the 95th percentile of earthquake magnitude in Japan using the *mb* magnitude type.**

In [2]:
# Load the csv file as a dataframe
df = pd.read_csv(r'C:\Users\thefli0\Downloads\parsed.csv')

# Filter the created dataframe for entries in Japan and 'mb' type
japan_mb_df = df[(df['parsed_place'] == 'Japan') & (df['magType'] == 'mb')].copy()

# Convert 'magnitude' column to numeric, force errors to NaN
japan_mb_df['mag'] = pd.to_numeric(japan_mb_df['mag'], errors='coerce')
# Drop rows with NaN values
japan_mb_df = japan_mb_df.dropna(subset=['mag'])

# Calculate 95th percentile
percentile_95 = japan_mb_df['mag'].quantile(0.95)

# Display the result
print(f"The 95th percentile of earthquake magnitudes in Japan using the 'mb' magnitude type is {percentile_95:.2f}")

The 95th percentile of earthquake magnitudes in Japan using the 'mb' magnitude type is 4.90


**2. Find the percentage of earthquakes in Indonesia that were coupled with tsunamis.**

In [3]:
# Load the csv file as a dataframe
df = pd.read_csv(r'C:\Users\thefli0\Downloads\parsed.csv')

# Filter the created dataframe for entries in Indonesia
indonesia_df = df[df['parsed_place'].str.contains('Indonesia', na=False)]

# Count the total number od earthquakes in Indonesia
total_earthquakes = len(indonesia_df)

# Filter entries for tsunami where 1 (indicating tsunami occurred
tsunami_events = len(indonesia_df[indonesia_df['tsunami'] == 1])

# Calculate percentage of earthquakes with tsunamis
if total_earthquakes > 0:
    tsunami_percentage = (tsunami_events / total_earthquakes) * 100
else:
    tsunami_percentage = 0

# Display results
print(f"The percentage of earthquakes in Indonesia that also had a tsunami is {tsunami_percentage:.2f}%")

The percentage of earthquakes in Indonesia that also had a tsunami is 23.13%


**3. Calculate summary statistics for earthquakes in Nevada.**

In [4]:
# Load the csv file as a dataframe
df = pd.read_csv(r'C:\Users\thefli0\Downloads\parsed.csv')

# Filter the created dataframe for entries with 'Nevada' and 'earthquake'
nevada_earthquakes_df = df[(df['type'] == 'earthquake') & (df['parsed_place'].str.contains('Nevada', na=False))]

# Generate summary statistics for filtered dataframe
summary_statistics = nevada_earthquakes_df.describe()

# Display results
print("Summary Statistics for Earthquakes in Nevada:")
print(summary_statistics)

Summary Statistics for Earthquakes in Nevada:
             cdi        dmin       felt         gap         mag   mmi  \
count  14.000000  647.000000  14.000000  647.000000  647.000000  1.00   
mean    2.421429    0.163155   2.500000  154.436615    0.437311  2.84   
std     0.514675    0.161793   4.783787   69.474945    0.653397   NaN   
min     2.000000    0.001000   1.000000   29.140000   -0.500000  2.84   
25%     2.000000    0.053000   1.000000   97.295000   -0.100000  2.84   
50%     2.200000    0.109000   1.000000  150.040000    0.300000  2.84   
75%     3.000000    0.223000   1.000000  200.515000    0.800000  2.84   
max     3.300000    1.414000  19.000000  355.910000    2.900000  2.84   

              nst         rms         sig          time  tsunami     tz  \
count  647.000000  647.000000  647.000000  6.470000e+02    647.0  647.0   
mean    12.704791    0.140627    9.146832  1.538318e+12      0.0 -480.0   
std     10.052695    0.056765   17.939055  5.954980e+08      0.0    0.0

**4. Add a column indicating whether the earthquake happened in a country or US state that is on the Ring of Fire. Use Alaska, Antarctica (look for Antarctic), Bolivia, California, Canada, Chile, Costa Rica, Ecuador, Fiji, Guatemala, Indonesia, Japan, Kermadec Islands, Mexico (be careful not to select New Mexico), New Zealand, Peru, Philippines, Russia, Taiwan, Tonga, and Washington.**

In [5]:
# Load the csv file as a dataframe
df = pd.read_csv(r'C:\Users\thefli0\Downloads\parsed.csv')

# Define list of "Ring of Fire" locations
ring_of_fire_locations = ['Alaska', 'Pacific-Antartic Ridge', 'Western Indian-Arctic Ridge',
                         'Bolivia', 'California', 'Canada', 'Chile', 'Costa Rica', 'Ecuador',
                         'Fiji', 'Guatemala', 'Indonesia', 'Japan', 'Kermadec Islands', 'Mexico', 
                         'New Zealand', 'Peru', 'Philippines', 'Russia', 'Taiwan', 'Tonga',
                         'Washington']

# Create new column with default value of 0
df['Ring of Fire'] = 0

# Set 'Ring of Fire' column to 1 for rows that meet specified criteria
df.loc[(df['type'] == 'earthquake') & (df['parsed_place'].isin(ring_of_fire_locations)), 'Ring of Fire'] = 1

# Save to new, updated file
df.to_csv(r'C:\Users\thefli0\Downloads\updated_parsed.csv', index=False)

# Display first few rows of updated dataframe
print(df.head())

  alert  cdi      code                                             detail  \
0   NaN  NaN  37389218  https://earthquake.usgs.gov/fdsnws/event/1/que...   
1   NaN  NaN  37389202  https://earthquake.usgs.gov/fdsnws/event/1/que...   
2   NaN  4.4  37389194  https://earthquake.usgs.gov/fdsnws/event/1/que...   
3   NaN  NaN  37389186  https://earthquake.usgs.gov/fdsnws/event/1/que...   
4   NaN  NaN  73096941  https://earthquake.usgs.gov/fdsnws/event/1/que...   

       dmin  felt    gap           ids   mag magType  ...           time  \
0  0.008693   NaN   85.0  ,ci37389218,  1.35      ml  ...  1539475168010   
1  0.020030   NaN   79.0  ,ci37389202,  1.29      ml  ...  1539475129610   
2  0.021370  28.0   21.0  ,ci37389194,  3.42      ml  ...  1539475062610   
3  0.026180   NaN   39.0  ,ci37389186,  0.44      ml  ...  1539474978070   
4  0.077990   NaN  192.0  ,nc73096941,  2.16      md  ...  1539474716050   

                           title  tsunami        type  \
0  M 1.4 - 9km NE of Ag

**5. Calculate the number of earthquakes in the Ring of Fire locations and the number outside of them.**

In [6]:
# Load the csv file as a dataframe
df = pd.read_csv(r'C:\Users\thefli0\Downloads\updated_parsed.csv')

# Calculate the number of earthquakes in Ring of Fire locations
num_ring_of_fire_earthquakes = df[df['Ring of Fire'] == 1].shape[0]

# Calculate the number of earthquakes outside of Ring of Fire locations
num_non_ring_of_fire_earthquakes = df[(df['type'] == 'earthquake') & (df['Ring of Fire'] == 0)].shape[0]

# Display results
print(f"Number of earthquakes in the Ring of Fire locations: {num_ring_of_fire_earthquakes}")
print(f"Number of earthquakes outside the Ring of Fire locations: {num_non_ring_of_fire_earthquakes}")

Number of earthquakes in the Ring of Fire locations: 7000
Number of earthquakes outside the Ring of Fire locations: 2081


**6. Find the tsunami count along the Ring of Fire.**

In [7]:
# Load the csv file as a dataframe
df = pd.read_csv(r'C:\Users\thefli0\Downloads\updated_parsed.csv')

# Filter the datafram for rows where 'Ring of Fire' equals 1 and 'tsunami' equals 1
tsunami_in_ring_of_fire_df = df[(df['Ring of Fire'] == 1) & (df['tsunami'] == 1)]

# Count number of tsunamis in filtered dataframe
num_tsunamis_in_ring_of_fire = tsunami_in_ring_of_fire_df.shape[0]

# Display results
print(f"Total count of tsunamis in the Ring of Fire locations: {num_tsunamis_in_ring_of_fire}")

Total count of tsunamis in the Ring of Fire locations: 45


## Pandas DataFrames

**Using the file WHO_first9cols.csv, complete the following tasks:**
 - Load the data into a DataFrame and print the results
 - Query the number of rows
 - Print the column headers
 - Print the data types
 - Print the index

In [8]:
# Load the csv file as the dataframe
df = read_csv(r"C:\Users\thefli0\Downloads\WHO_first9cols.csv")

# Display the first five rows with headers
print("Dataframe Top 5 rows:\n", df.head())

Dataframe Top 5 rows:
        Country  CountryID  Continent  Adolescent fertility rate (%)  \
0  Afghanistan          1          1                          151.0   
1      Albania          2          2                           27.0   
2      Algeria          3          3                            6.0   
3      Andorra          4          2                            NaN   
4       Angola          5          3                          146.0   

   Adult literacy rate (%)  \
0                     28.0   
1                     98.7   
2                     69.9   
3                      NaN   
4                     67.4   

   Gross national income per capita (PPP international $)  \
0                                                NaN        
1                                             6000.0        
2                                             5940.0        
3                                                NaN        
4                                             3890.0        

  

In [9]:
# Displays the number of rows contained within
print("Length:\n", len(df))
print("\n")

# Displays column headers
print("Column Headers:\n", df.columns)
print("\n")

# Displays the data types
print("Data types:\n", df.dtypes)
print("\n")

# Displays the index
print("Index:\n", df.index)
print("\n")

Length:
 202


Column Headers:
 Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)',
       'Gross national income per capita (PPP international $)',
       'Net primary school enrolment ratio female (%)',
       'Net primary school enrolment ratio male (%)',
       'Population (in thousands) total'],
      dtype='object')


Data types:
 Country                                                    object
CountryID                                                   int64
Continent                                                   int64
Adolescent fertility rate (%)                             float64
Adult literacy rate (%)                                   float64
Gross national income per capita (PPP international $)    float64
Net primary school enrolment ratio female (%)             float64
Net primary school enrolment ratio male (%)               float64
Population (in thousands) total                           float64
dtype: o

## Pandas Series

**Using the same file, select the "Country" column and return its data type along with the series shape, index, values, and name.**

In [10]:
# Filter the dataframe for "Country"
country_col = df["Country"]

# Display the data type
print("Type df:\n", type(df), "\n")
print("Type country col:\n", type(country_col), "\n")

Type df:
 <class 'pandas.core.frame.DataFrame'> 

Type country col:
 <class 'pandas.core.series.Series'> 



In [11]:
# Display the series shape
print("Series shape:\n", country_col.shape, "\n")

# Display the series index
print("Series index:\n", country_col.index, "\n")

# Display the series values
print("Series values:\n", country_col.values, "\n")

# Display the series name
print("Series name:\n", country_col.name, "\n")

Series shape:
 (202,) 

Series index:
 RangeIndex(start=0, stop=202, step=1) 

Series values:
 ['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria'
 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Brunei Darussalam'
 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Cook Islands' 'Costa Rica'
 "Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus' 'Czech Republic' 'Denmark'
 'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia' 'Fiji'
 'Finland' 'France' 'French Polynesia' 'Gabon' 'Gambia' 'Georgia'
 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau'
 'Guyana' 'Haiti' 'Honduras' 'Ho

## Querying data in Pandas

**Using the Quandl API, import the data**
- Print the head() and tail()
- Query the last value using the last date
- Query the date with date strings in the YYYYMMDD format
- Query with a Boolean, where the number of observations is greater than the mean number of observations
- Query with a Boolean, where the number of sunspots is greater than the mean number of sunspots

In [12]:
# Import the data using the Quandl API
sunspots = quandl.get("SIDC/SUNSPOTS_A")

# Display first five results
print("Head 5:\n", sunspots.head(5))

# Display bottom five results
print("Tail 5:\n", sunspots.tail(5))

Head 5:
             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
1700-12-31                               8.3                             NaN   
1701-12-31                              18.3                             NaN   
1702-12-31                              26.7                             NaN   
1703-12-31                              38.3                             NaN   
1704-12-31                              60.0                             NaN   

            Number of Observations  Definitive/Provisional Indicator  
Date                                                                  
1700-12-31                     NaN                               1.0  
1701-12-31                     NaN                               1.0  
1702-12-31                     NaN                               1.0  
1703-12-31                     NaN                               1.0  
1704

In [13]:
# Query the last value using the last date in the dataset
last_date = sunspots.index[-1]
print("Last value:\n",sunspots.loc[last_date])

Last value:
 Yearly Mean Total Sunspot Number        8.8
Yearly Mean Standard Deviation          4.1
Number of Observations              14440.0
Definitive/Provisional Indicator        1.0
Name: 2020-12-31 00:00:00, dtype: float64


In [14]:
# Display the date with the YYYYMMDD format
print("Values slice by date:\n", sunspots["20020101": "20131231"])

Values slice by date:
             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
2002-12-31                             163.6                             9.8   
2003-12-31                              99.3                             7.1   
2004-12-31                              65.3                             5.9   
2005-12-31                              45.8                             4.7   
2006-12-31                              24.7                             3.5   
2007-12-31                              12.6                             2.7   
2008-12-31                               4.2                             2.5   
2009-12-31                               4.8                             2.5   
2010-12-31                              24.9                             3.4   
2011-12-31                              80.8                             6.7   
2012-12-31       

In [15]:
# Use boolean to find where number of observations is greater than the mean
print("Boolean selection:\n", sunspots[sunspots > sunspots.mean()])

Boolean selection:
             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
1700-12-31                               NaN                             NaN   
1701-12-31                               NaN                             NaN   
1702-12-31                               NaN                             NaN   
1703-12-31                               NaN                             NaN   
1704-12-31                               NaN                             NaN   
...                                      ...                             ...   
2016-12-31                               NaN                             NaN   
2017-12-31                               NaN                             NaN   
2018-12-31                               NaN                             NaN   
2019-12-31                               NaN                             NaN   
2020-12-31          

In [16]:
# Use boolean to find where number of sunspots is greater than the mean
print("Boolean selection with column label:\n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()])

Boolean selection with column label:
             Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
Date                                                                           
1981-12-31                             198.9                            13.1   
1982-12-31                             162.4                            12.1   
1983-12-31                              91.0                             7.6   
1984-12-31                              60.5                             5.9   
1985-12-31                              20.6                             3.7   
1986-12-31                              14.8                             3.5   
1987-12-31                              33.9                             3.7   
1988-12-31                             123.0                             8.4   
1989-12-31                             211.1                            12.8   
1990-12-31                             191.8                            11.2   
19

## Statistics with Pandas DataFrame

**Using the Quandl API, import the data and run the following descriptive stats where Sunspots is not equal to NaN.**
 - Print the results of the describe function
 - Print the count of observations
 - Print the mad
 - Print the mean
 - Print the median
 - Print the Max
 - Print the Min
 - Print the Mode
 - Print the standard deviation
 - Print the variance
 - Print the Skewness

In [17]:
# Import the data using the Quandl API
sunspots = quandl.get("SIDC/SUNSPOTS_A")

# Display results for describe function
print("Describe", sunspots.describe(),"\n")

# Display count of observations
print("Non NaN observations", sunspots.count(),"\n")

# No attribute for Mean Absolute Deviation (MAD) with DataFrame
# Define function to calculate MAD
def calculate_mad(series):
    mean_value = series.mean()
    mad_value = np.mean(np.abs(series - mean_value))
    return mad_value

# Define applicable data types
mad_values = sunspots.select_dtypes(include=['float64']).apply(calculate_mad)
# Display the MAD
print("MAD", (mad_values), "\n")

# Display the median
print("Mean", sunspots.mean(),"\n")

# Display the median
print("Median", sunspots.median(),"\n")

# Display the Max
print("Max", sunspots.max(),"\n")

# Display the Min
print("Min", sunspots.min(),"\n")

# Display the mode
print("Mode", sunspots.mode(),"\n")

# Display the standard deviation
print("Standard Deviation", sunspots.std(),"\n")

# Display the variance
print("Variance", sunspots.var(),"\n")

# Display the skewness
print("Skewness", sunspots.skew(),"\n")

Describe        Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  \
count                        321.000000                      203.000000   
mean                          78.517134                        7.892118   
std                           62.091523                        3.866310   
min                            0.000000                        0.500000   
25%                           24.200000                        4.550000   
50%                           65.300000                        7.600000   
75%                          115.200000                       10.350000   
max                          269.300000                       19.100000   

       Number of Observations  Definitive/Provisional Indicator  
count              203.000000                             321.0  
mean              1691.857143                               1.0  
std               2913.060813                               0.0  
min                150.000000                      

## Data Aggregation

**Using the NumPy random data generator, create a data frame with the following columns (Weather, Food Price, and Number)(pg. 70 of your text).**
 - Group the data by the weather column and then create a function to iterate through the groups (you should have two groups after, for hot and cold)
 - Your function/variable that you created (weather_group)can be used for aggregation methods - print the first row, the last row, and the mean for each group
 - Create another group, on Food (so you would have Weather and Food)
 - Using the new groups - use the NumPy function agg() to find the mean and median number and prices

In [18]:
# Initialize the random generator
seed(42)

# Load the dataframe
df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot',
   'cold', 'hot', 'cold'],
   'Food' : ['soup', 'soup', 'icecream', 'chocolate',
   'icecream', 'icecream', 'soup'],
   'Price' : 10 * rand(7), 'Number' : randint(1, 9)})

# Display results
print(df)

  Weather       Food     Price  Number
0    cold       soup  3.745401       8
1     hot       soup  9.507143       8
2    cold   icecream  7.319939       8
3     hot  chocolate  5.986585       8
4    cold   icecream  1.560186       8
5     hot   icecream  1.559945       8
6    cold       soup  0.580836       8


In [19]:
# Define the function to iterate through groups
def weather_group(name, group):
    print(f"Weather Group: {name}")
    print("First row:")
    print(group.iloc[0])
    print("\nLast row:")
    print(group.iloc[-1])
    print("\nMean values:")
    print(group.mean(numeric_only=True))
    print("\n")

# Group by Weather
weather_groups = df.groupby('Weather')

# Apply function to each group
for name, group in weather_groups:
    weather_group(name, group)

Weather Group: cold
First row:
Weather        cold
Food           soup
Price      3.745401
Number            8
Name: 0, dtype: object

Last row:
Weather        cold
Food           soup
Price      0.580836
Number            8
Name: 6, dtype: object

Mean values:
Price     3.301591
Number    8.000000
dtype: float64


Weather Group: hot
First row:
Weather         hot
Food           soup
Price      9.507143
Number            8
Name: 1, dtype: object

Last row:
Weather         hot
Food       icecream
Price      1.559945
Number            8
Name: 5, dtype: object

Mean values:
Price     5.684558
Number    8.000000
dtype: float64




In [20]:
# Define the new group
wf_group = df.groupby(['Weather', 'Food'])

# Display the results
print("WF Groups", wf_group.groups)

WF Groups {('cold', 'icecream'): [2, 4], ('cold', 'soup'): [0, 6], ('hot', 'chocolate'): [3], ('hot', 'icecream'): [5], ('hot', 'soup'): [1]}


In [21]:
# Display the results for the mean and median with new groups and the agg() function
print("WF Aggregated\n", wf_group.agg([np.mean, np.median]))

WF Aggregated
                       Price           Number       
                       mean    median   mean median
Weather Food                                       
cold    icecream   4.440063  4.440063    8.0    8.0
        soup       2.163119  2.163119    8.0    8.0
hot     chocolate  5.986585  5.986585    8.0    8.0
        icecream   1.559945  1.559945    8.0    8.0
        soup       9.507143  9.507143    8.0    8.0


  print("WF Aggregated\n", wf_group.agg([np.mean, np.median]))
  print("WF Aggregated\n", wf_group.agg([np.mean, np.median]))
  print("WF Aggregated\n", wf_group.agg([np.mean, np.median]))


## Concatenating and appending DataFrames

**Using the dataframe you created previously, select the first 3 rows.**
 - Using the concat function from pandas, put the 3 rows that you selected back with the original dataframe
 - Using the append function, take those 3 rows and the last 2 rows of the original DataFrame and bring them together

In [22]:
# Display the first 3 rows
print("df :3\n", df[:3])

df :3
   Weather      Food     Price  Number
0    cold      soup  3.745401       8
1     hot      soup  9.507143       8
2    cold  icecream  7.319939       8


In [23]:
# Use the concat function to put the first 3 rows back
print("Concatenate Back\n", pd.concat([df[:3], df[3:]]))

Concatenate Back
   Weather       Food     Price  Number
0    cold       soup  3.745401       8
1     hot       soup  9.507143       8
2    cold   icecream  7.319939       8
3     hot  chocolate  5.986585       8
4    cold   icecream  1.560186       8
5     hot   icecream  1.559945       8
6    cold       soup  0.580836       8


In [24]:
# Use append to bring together the first 3 rows with the last 2 rows
append_rows = pd.concat([df.head(3), df.tail(2)], ignore_index=True)
print("Appending rows\n", append_rows)

Appending rows
   Weather      Food     Price  Number
0    cold      soup  3.745401       8
1     hot      soup  9.507143       8
2    cold  icecream  7.319939       8
3     hot  icecream  1.559945       8
4    cold      soup  0.580836       8


## Joining DataFrames

**Using the two csv files dtest.csv and tips.csv, we will bring together two datasets, also known as a join.**
 - Using the merge() function, bring dtest and tips together on the "EmpNr" column and print the results
 - Using the join() function, query both files and print the results

In [25]:
# Join the two datasets together
dests = pd.read_csv(r'C:\Users\thefli0\Downloads\dest.csv')
print("Dests\n", dests)

tips = pd.read_csv(r'C:\Users\thefli0\Downloads\tips.csv')
print("Tips\n", tips)

# Use the merge function to bring files together on the 'EmpNr' column
print("Merge() on key\n", pd.merge(dests, tips, on='EmpNr'))
# Display the results
print("Dests join() tips\n", dests.join(tips, lsuffix='Dest', rsuffix='Tips'))

# Display the results of the joins
print("Inner join with merge()\n", pd.merge(dests, tips, how='inner'))
print("Outer join\n", pd.merge(dests, tips, how='outer'))

Dests
    EmpNr       Dest
0      5  The Hague
1      3  Amsterdam
2      9  Rotterdam
Tips
    EmpNr  Amount
0      5    10.0
1      9     5.0
2      7     2.5
Merge() on key
    EmpNr       Dest  Amount
0      5  The Hague    10.0
1      9  Rotterdam     5.0
Dests join() tips
    EmpNrDest       Dest  EmpNrTips  Amount
0          5  The Hague          5    10.0
1          3  Amsterdam          9     5.0
2          9  Rotterdam          7     2.5
Inner join with merge()
    EmpNr       Dest  Amount
0      5  The Hague    10.0
1      9  Rotterdam     5.0
Outer join
    EmpNr       Dest  Amount
0      5  The Hague    10.0
1      3  Amsterdam     NaN
2      9  Rotterdam     5.0
3      7        NaN     2.5


## Handling Missing Values

**Using the WHO_first9cols.csv file, select the first 3 rows, including the headers for these two columns Country & Net primary school enrollment ratio male (%)**
 - Check for missing values
 - Count the number of NaN values
 - Print any non-missing values
 - Replace the missing values with a scalar value

In [26]:
# Load the csv file as the dataframe
df = read_csv(r"C:\Users\thefli0\Downloads\WHO_first9cols.csv")

# Select first 3 rows of country and Net primary school enrollment ratio male (%)
df = df[['Country', df.columns[-2]]][:2]

# Display count of missing values
print("Null Values\n", pd.isnull(df))

# Display count of NaN values
print("Total Null Values\n", pd.isnull(df).sum())

# Display count of non-missing values
print("Not Null Values\n", df.notnull())

# Replace any missing values
print("Zero filled\n", df.fillna(0))

Null Values
    Country  Net primary school enrolment ratio male (%)
0    False                                         True
1    False                                        False
Total Null Values
 Country                                        0
Net primary school enrolment ratio male (%)    1
dtype: int64
Not Null Values
    Country  Net primary school enrolment ratio male (%)
0     True                                        False
1     True                                         True
Zero filled
        Country  Net primary school enrolment ratio male (%)
0  Afghanistan                                          0.0
1      Albania                                         94.0
