# Assignment Chapter 6 - Data Wrangling with Python

**Instructions**
1.	This assignment is split into 2 parts. Part 1 will focus on data inspection. Part 2 will involve wrangling and exporting a dataset.  
2.	Please answer Part 1 questions in the boxes provided.
3.	For Part 2, export your clean and wrangled data as an “assignment.csv” file.
4.	Please submit the assignment through the TalentLabs Learning System. You will need to submit a zip file which contains this word document (with answers) and the wrangled data assignment.csv file.

## Part 1 – Data Inspection with Python

In [1]:
import pandas

df = pandas.read_csv("./GlobalTemperatures.csv")
df.head(10)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
9,1744-08-01,,,Århus,Denmark,57.05N,10.33E


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   dt                             object 
 1   AverageTemperature             float64
 2   AverageTemperatureUncertainty  float64
 3   City                           object 
 4   Country                        object 
 5   Latitude                       object 
 6   Longitude                      object 
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


In [3]:
# Question 1.1:
#How many rows and columns make up this dataset?

rows = df.shape[0]
columns = df.shape[1]

print(f"No. of Rows: {rows}")
print(f"No. of Columns: {columns}")

No. of Rows: 8599212
No. of Columns: 7


In [4]:
# Question 1.2:
# How many duplicated rows are there?

no_of_duplicated_rows = df.duplicated().sum()

print(f"Total {no_of_duplicated_rows} duplicated rows are there")

Total 0 duplicated rows are there


In [5]:
# Question 1.3:
# How many columns have missing values?

print("No. of missing values:\n")

df.isna().sum()

No. of missing values:



dt                                    0
AverageTemperature               364130
AverageTemperatureUncertainty    364130
City                                  0
Country                               0
Latitude                              0
Longitude                             0
dtype: int64

In [6]:
# Question 1.4:
# Is there a pattern to the missing data?

## Check if missing values in "AverageTemperature" and "AverageTemperatureUncertainty" columns
## occur together in each row.
df[df.AverageTemperature.isna() != df.AverageTemperatureUncertainty.isna()]

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude


In [7]:
# Question 1.5:
# How many unique countries and cities are there in the dataset?

no_of_unique_countries = df.Country.nunique()
no_of_unique_cities = df.City.nunique()

print(f"No. of unique countries: {no_of_unique_countries}")
print(f"No. of unique cities: {no_of_unique_cities}")

No. of unique countries: 159
No. of unique cities: 3448


In [23]:
# Question 1.6:
#What is the date range of the data? (Use dd.mm.yyyy date format for your answer)

df["dt"] = pandas.to_datetime(df["dt"])
start_date = df.dt.min().strftime("%d.%m.%Y")
end_date = df.dt.max().strftime("%d.%m.%Y")

print(start_date, "-", end_date)

01.11.1743 - 01.09.2013


## Part 2 – Data Wrangling with Python

In this final part of the assignment, your task is to prepare the GlobalTemperatures.csv dataset for analysis. Carry out the actions below to wrangle this dataset. Once finished, create a zip folder submission.zip which contains your wrangled dataset and this word document with the answer to questions in Parts 1, 2 and 3 of this assignment. Make sure the wrangled dataset is named assignment.csv. Good luck!

Data wrangling tasks:

- Rename the dt column to Date.
- Convert the format of the Date column to datetime.
- Drop any rows which contain missing values.
- Combine the Latitude and Longitude columns into one “Location” column. Make sure the values are separated by a comma and a space, e.g.: 	57.05N, 10.33E
- Drop the Latitude and Longitude columns.
- Filter the data to only contain rows with Australia and Brazil countries.
- Sort the data by date with the most recent dates first.
- Export the data as a csv file called “assignment.csv”. Do not include the index column: df.to_csv(“assignment.csv”, index=False).

In [9]:
import pandas

df2 = pandas.read_csv("./GlobalTemperatures.csv")
df2.head(5)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [10]:
# Convert the format of the Date column to datetime.
df3 = df2.copy(deep=True)
df3.rename(columns={"dt":"Date"}, inplace=True)
df3.head(5)

Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [25]:
# check which columns have missing values
df3.isna().sum()

Date                                  0
AverageTemperature               364130
AverageTemperatureUncertainty    364130
City                                  0
Country                               0
Latitude                              0
Longitude                             0
dtype: int64

In [26]:
# Drop any rows which contain missing values.
df3.dropna(axis=0, inplace=True)
df3.head(3)

Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E


In [28]:
# See if the rows with missing values are dropped
df3.isna().sum()

Date                             0
AverageTemperature               0
AverageTemperatureUncertainty    0
City                             0
Country                          0
Latitude                         0
Longitude                        0
Location                         0
dtype: int64

In [27]:
# Combine the Latitude and Longitude columns into one “Location” column. 
# Make sure the values are separated by a comma and a space, e.g.: 	57.05N, 10.33E
df3["Location"] = df3["Latitude"] + "," + df3["Longitude"]
df3.head(3)

Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude,Location
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E,"57.05N,10.33E"
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E,"57.05N,10.33E"
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E,"57.05N,10.33E"


In [30]:
# Drop the Latitude and Longitude columns.
df4 = df3.copy(deep=True)
df4.drop(columns=["Latitude","Longitude"], inplace=True)
df4

Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,Country,Location
0,1743-11-01,6.068,1.737,Århus,Denmark,"57.05N,10.33E"
5,1744-04-01,5.788,3.624,Århus,Denmark,"57.05N,10.33E"
6,1744-05-01,10.644,1.283,Århus,Denmark,"57.05N,10.33E"
7,1744-06-01,14.051,1.347,Århus,Denmark,"57.05N,10.33E"
8,1744-07-01,16.082,1.396,Århus,Denmark,"57.05N,10.33E"
...,...,...,...,...,...,...
8599206,2013-04-01,7.710,0.182,Zwolle,Netherlands,"52.24N,5.26E"
8599207,2013-05-01,11.464,0.236,Zwolle,Netherlands,"52.24N,5.26E"
8599208,2013-06-01,15.043,0.261,Zwolle,Netherlands,"52.24N,5.26E"
8599209,2013-07-01,18.775,0.193,Zwolle,Netherlands,"52.24N,5.26E"


In [34]:
# Filter the data to only contain rows with Australia and Brazil countries.
df5 = df4[df4["Country"].isin(["Australia","Brazil"])]
df5.head(5)

Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,Country,Location
78638,1841-01-01,21.432,3.286,Adelaide,Australia,"34.56S,138.16E"
78639,1841-02-01,22.087,2.458,Adelaide,Australia,"34.56S,138.16E"
78640,1841-03-01,18.859,3.547,Adelaide,Australia,"34.56S,138.16E"
78641,1841-04-01,15.033,1.884,Adelaide,Australia,"34.56S,138.16E"
78642,1841-05-01,12.864,1.481,Adelaide,Australia,"34.56S,138.16E"


In [36]:
# Sort the data by date with the most recent dates first. 
df6 = df5.sort_values(by="Date", ascending=False)
df6

Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,Country,Location
8243891,2013-08-01,16.126,0.325,Wollongong,Australia,"34.56S,151.78E"
6168448,2013-08-01,23.175,0.663,Queimados,Brazil,"23.31S,42.82W"
2640105,2013-08-01,21.685,0.659,Governador Valadares,Brazil,"18.48S,42.25W"
7281458,2013-08-01,18.215,0.742,Suzano,Brazil,"23.31S,46.31W"
6474477,2013-08-01,17.696,0.527,São José,Brazil,"28.13S,48.18W"
...,...,...,...,...,...,...
6746356,1824-02-01,25.088,2.012,Santarém,Brazil,"2.41S,55.45W"
1010754,1824-01-01,27.162,1.183,Boa Vista,Brazil,"2.41N,60.27W"
6746355,1824-01-01,25.962,1.348,Santarém,Brazil,"2.41S,55.45W"
4474598,1824-01-01,25.883,1.263,Macapá,Brazil,"0.80N,50.63W"


In [None]:
# Export the data as a csv file called “assignment.csv”. 
# Do not include the index column: df.to_csv(“assignment.csv”, index=False).
df5.to_csv("assignment.csv", index=False)