### Instructions:

- You can attempt any number of questions and in any order.  
  See the assignment page for a description of the hurdle requirement for this assessment.
- You may submit your practical for autograding as many times as you like to check on progress, however you will save time by checking and testing your own code before submitting.
- Develop and check your answers in the spaces provided.
- **Replace** the code `raise NotImplementedError()` with your solution to the question.
- Do **NOT** remove any variables other provided markings already provided in the answer spaces.
- Do **NOT** make any changes to this notebook outside of the spaces indicated.  
  (If you do this, the submission system might not accept your work)

### Submitting:

1. Before you turn this problem in, make sure everything runs as expected by resetting this notebook.    
   (You can do this from the menubar above by selecting `Kernel`&#8594;`Restart Kernel and Run All Cells...`)
1. Don't forget to save your notebook after this step.
1. Submit your .ipynb file to Gradescope via file upload or GitHub repository.
1. You can submit as many times as needed.
1. You **must** give your submitted file the **identical** filename to that which you downloaded without changing **any** aspects - spaces, underscores, capitalisation etc. If your operating system has changed the filename because you downloaded the file twice or more you **must** also fix this.  



---

# <mark style="background: #801010; color: #ffffff;" >B2</mark> Topic 5: Working with pandas 🐼🐼🐼   

In [2]:
# Useful imports...
import numpy as np
import pandas as pd
import string
import math

#### Question 01 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2 Points)

Create a pandas Series named `series_1` with values:
```python
'a','b','c','d','e'
```
and index from 1 to 5 inclusive.

In [4]:
series_1 = pd.Series(['a','b','c','d','e'], index=range(1,6))
series_1

1    a
2    b
3    c
4    d
5    e
dtype: object

In [5]:
# Testing Cell (Do NOT modify this cell)

#### Question 02 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2 Points)

Create a pandas Series named `series_2` with a default index and the values:
```python
'data', 'python', 'science', 'machine', 'learning'
```

In [6]:
# Write your solution here

# YOUR CODE HERE
series_2 = pd.Series(['data','python','science','machine','learning'])
series_2

0        data
1      python
2     science
3     machine
4    learning
dtype: object

In [7]:
# Testing Cell (Do NOT modify this cell)

#### Question 03 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2 Points)

Create a pandas Series named `series_3` from the `np.array` variable supplied below with an index of the uppercase alphabet 'A' to 'Z'.

In [3]:
array = np.array([ 7, 59, 42, 13, 67, 19, 26, 59, 99, 97, 77,  1, 36, 49, 10, 51, 41,
       73, 33, 79, 19, 34, 84, 11, 41, 75])

In [4]:
index = [chr(i) for i in range(65, 91)]  # uppercase alphabet 'A' to 'Z'
series_3 = pd.Series(array, index=index)
series_3

A     7
B    59
C    42
D    13
E    67
F    19
G    26
H    59
I    99
J    97
K    77
L     1
M    36
N    49
O    10
P    51
Q    41
R    73
S    33
T    79
U    19
V    34
W    84
X    11
Y    41
Z    75
dtype: int32

In [10]:
# Testing Cell (Do NOT modify this cell)

#### Question 04 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4 Points)

Copy the Series above (`series_3`) to a new object named `series_4` and assign an index of the lowercase alphabet 'a' to 'z'. The first value of `series_4` should be assigned the value `100` while `series_3` must remain unchanged.

In [11]:
series_4 = series_3.copy()  # Copying series_3 to series_4
series_4.index = [chr(i) for i in range(97, 123)]  # Lowercase alphabet from 'a' to 'z'
series_4.iloc[0] = 100  # Assigning the first value of series_4 to 100
print("\nSeries 4:")
print(series_4)


Series 4:
a    100
b     59
c     42
d     13
e     67
f     19
g     26
h     59
i     99
j     97
k     77
l      1
m     36
n     49
o     10
p     51
q     41
r     73
s     33
t     79
u     19
v     34
w     84
x     11
y     41
z     75
dtype: int32


In [12]:
# Testing Cell (Do NOT modify this cell)

#### Question 05 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4 Points)

Given a pandas Series of the integers from 1 to 4 named `series_5`, transform this series into a series of floats and set the name of the series to be `'Float data'`.


---
<details>
  <summary><span style="color:blue">Finding documentation on attributes and methods</span></summary>
    It's <strong>essential</strong> to be able to use the pandas API reference to discover details of the pandas classes (Series, DataFrame, Index etc). You can find how to set the data type and Series name here:
    <a href="https://pandas.pydata.org/docs/reference/index.html">API reference</a>
</details>

In [13]:
series_5 = pd.Series(['1','2','3','4'])
series_5 = series_5.astype(float)  # changing type of the elements in series
series_5 = series_5.rename('Float data')     # changing name of the series

In [14]:
# Testing Cell (Do NOT modify this cell)

#### Question 06 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4 Points)

Create a Series with name `series_6` that contains all values greater than 50 in the NumPy `array` from Question 3. You should rely on NumPy filtering to sift these values.


---
<details>
  <summary><span style="color:blue">Filtering NumPy arrays</span></summary>
One powerful NumPy concept is to filter the contents of an array according to a mask. Consider our `array` example and wishing to find which elements are "mod 2". We can form a boolean mask like this:
<pre>

print ([array % 2 == 0])
[array([False, False,  True, False, False, False,  True, False, False,
        False, False, False,  True, False,  True, False, False, False,
        False, False, False,  True,  True, False, False, False])]     
</pre>        
or obtain the values as
<pre>
print (array[array % 2 == 0])
[42 26 36 10 34 84]

</pre>
This is explained in the course readings <a href="https://ebookcentral.proquest.com/lib/adelaide/reader.action?docID=5446042&ppg=66">on page 58 of Hands-On Data Analysis with NumPy and Pandas : Implement Python Packages from Data Manipulation to Processing</a>.
</details>

In [15]:
series_6 = pd.Series(array[array > 50])
series_6

0     59
1     67
2     59
3     99
4     97
5     77
6     51
7     73
8     79
9     84
10    75
dtype: int32

In [16]:
# Testing Cell (Do NOT modify this cell)

#### Question 07 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4 Points)

Create a Series of floats that represents the values of the `array` in Question 3 where:
- the value divides evenly by 3 ("mod 3"), and
- the stored value is one sixth or the original value. 
Assign this Series to the variable `series_7`. Again, you should be looking to harness the featues of NumPy and pandas to create your Series.

For example, given an array with values:
```python
7, 6, 9
```
then `series_7` would be a pandas Series of:
```python
1.0, 1.5
```

In [17]:
filtered = array[array%3==0]/6
#filtered
series_7 = pd.Series(filtered)
series_7

0     7.0
1    16.5
2     6.0
3     8.5
4     5.5
5    14.0
6    12.5
dtype: float64

In [18]:
# Testing Cell (Do NOT modify this cell)

#### Question 08 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4 Points)

Given the mixed type dictionary `mixed_data` below, create a pandas DataFrame from that data and assign it to a variable named `dataframe_8`.

---
<details>
  <summary><span style="color:blue">New pandas classes...</span></summary>
   Please refer to the <a href="https://pandas.pydata.org/docs/reference/index.html">API reference</a> to obtain details on any pandas classes that you have not yet encountered.
</details>

In [19]:
mixed_data = {
    "A": 42.0,
    "B": pd.Timestamp("20220704"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foundations of computer science",
}

In [20]:
# Write your solution here
dataframe_8 = pd.DataFrame(mixed_data)
dataframe_8

Unnamed: 0,A,B,C,D,E,F
0,42.0,2022-07-04,1.0,3,test,foundations of computer science
1,42.0,2022-07-04,1.0,3,train,foundations of computer science
2,42.0,2022-07-04,1.0,3,test,foundations of computer science
3,42.0,2022-07-04,1.0,3,train,foundations of computer science


In [21]:
# Testing Cell (Do NOT modify this cell)

#### Question 09 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4 Points)

In Question 8, we created a DataFrame with 6 columns. Create a series called `series_9` with values that reflect the **data type** of each column in `dataframe_8` and an index consisting of the column labels. Set the `name` attribute of this series to `'Column Types'`.

In [22]:
series_9 = pd.Series(dataframe_8.dtypes, index=dataframe_8.columns)
series_9 = series_9.rename("Column Types")
print(series_9)

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
Name: Column Types, dtype: object


In [23]:
# Testing Cell (Do NOT modify this cell)


#### Question 10 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5 Points)

Given a dictionary called `pricing` (below), create a DataFrame that contains only "object" and "colour" elements in that order and with column names of "object" and "colour". Assign this DataFrame to the variable `dataframe_10`.

In [24]:
pricing = {'colour' : ['blue','green','yellow','red','white'],
        'object' : ['ball','pen','pencil','paper','mug'],
        'price' : [1.2,1.0,0.6,0.9,1.7]}
dataframe_10 = pd.DataFrame({'object': pricing['object'], 'colour': pricing['colour']})
dataframe_10

Unnamed: 0,object,colour
0,ball,blue
1,pen,green
2,pencil,yellow
3,paper,red
4,mug,white


In [25]:
# Testing Cell (Do NOT modify this cell)


#### Question 11 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5 Points)

Create a DataFrame, `dataframe_11`, that is a copy of `dataframe_10` but with the index:
```python
  'one', 'two', 'three', 'four', 'five'
```
Change the colour of the mug from `'white'` to `'black'`. `dataframe_10` should not be changed.

In [26]:
dataframe_11 = dataframe_10.copy()
dataframe_11.index = ['one','two','three','four','five']
dataframe_11.loc['five','colour'] = 'black'
dataframe_11

Unnamed: 0,object,colour
one,ball,blue
two,pen,green
three,pencil,yellow
four,paper,red
five,mug,black


In [27]:
# Testing Cell (Do NOT modify this cell)

#### Question 12 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5 Points)

Download the file `weather-utf8.csv` from the MyUni assignment page for this practical and co-locate the file with this notebook. This file contains weather station data from the Australian Bureau of Meteorology for dates in January 2022. Data includes temperature, wind and rainfall information.

Load the file into a DataFrame called `dataframe_12` and set the index for this dataframe to be the 'Date'.

---
<details>
  <summary><span style="color:blue">pandas API Reference for Input/Output</span></summary>
   You can find how to read and write a CSV file to a DataFrame here:
    <a href="https://pandas.pydata.org/docs/reference/io.html">API reference (Input/Output)</a>
</details>

---
<details>
  <summary><span style="color:blue">Help! My index won't stay put?!?</span></summary>
   You can find how to set the index of DataFrame "in place" here:
    <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html">API reference (set_index)</a>
</details>

In [32]:
dataframe_12 = pd.read_csv("weather-utf8.csv")
dataframe_12.set_index('Date', inplace=True)
dataframe_12

Unnamed: 0_level_0,Minimum temperature (C),Maximum temperature (C),Rainfall (mm),Direction of maximum wind gust,Speed of maximum wind gust (km/h),Time of maximum wind gust
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1/1/22,22.5,33.6,0.0,SW,30,15:39
2/1/22,19.3,30.6,0.0,SW,35,14:01
3/1/22,14.1,25.9,0.0,S,37,17:41
4/1/22,14.2,24.4,0.0,SE,41,12:00
5/1/22,14.4,21.5,0.0,S,37,13:23
6/1/22,15.8,22.3,0.0,SW,39,19:32
7/1/22,16.6,21.0,1.6,SW,43,15:40
8/1/22,16.0,24.4,1.6,SW,31,14:35
9/1/22,11.7,31.7,0.0,WNW,24,13:00
10/1/22,19.8,37.0,0.0,E,52,22:43


In [None]:
# Testing Cell (Do NOT modify this cell)


#### Question 13 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5 Points)

Using the data loaded in Question 12, create a pandas DataFrame called `dataframe_13` containing the maximum temperature and the rainfall only on those days when the maximum temperature for the day was greater than 34 degrees.

In [33]:
# filter the rows based on temp > 34
filtered_rows = dataframe_12[dataframe_12['Maximum temperature (C)']>34.0]
# now, from the filtered rows select only the two columns needed
dataframe_13 = filtered_rows[['Maximum temperature (C)', 'Rainfall (mm)']]
dataframe_13

Unnamed: 0_level_0,Maximum temperature (C),Rainfall (mm)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
10/1/22,37.0,0.0
11/1/22,40.3,0.0
20/1/22,35.1,0.0
26/1/22,34.1,0.0
27/1/22,34.4,14.2
31/1/22,35.2,0.0


In [None]:
# Testing Cell (Do NOT modify this cell)

#### Question 14 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5 Points)

Using the data loaded in Question 12 and with reference to the [pandas API](https://pandas.pydata.org/docs/reference/index.html), create a two value tuple called `mean_14` that holds the mean of the maximum temperature and minimum temperature for the month of January like:
```python
    (mean_max_temp, mean_min_temp)
```

In [34]:
mean_max_temp = dataframe_12['Maximum temperature (C)'].mean()
mean_min_temp = dataframe_12['Minimum temperature (C)'].mean()
mean_14 = (mean_max_temp, mean_min_temp)
mean_14

(29.061290322580643, 18.583870967741937)

In [31]:
# Testing Cell (Do NOT modify this cell)

#### Question 15 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5 Points)

Create a DataFrame called `dataframe_15` by duplicating `dataframe_12`. To reflect questionable sensor data when windspeed falls below 28 km/h, set these values (speed of maximum windgust < 28 km/h) to be the Not a Number value `np.nan`.

In [35]:
dataframe_15 = dataframe_12.copy()
# Replace the values where windspeed is below 28 km/hr
dataframe_15.loc[dataframe_15['Speed of maximum wind gust (km/h)']<28, 'Speed of maximum wind gust (km/h)']=np.nan
dataframe_15

Unnamed: 0_level_0,Minimum temperature (C),Maximum temperature (C),Rainfall (mm),Direction of maximum wind gust,Speed of maximum wind gust (km/h),Time of maximum wind gust
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1/1/22,22.5,33.6,0.0,SW,30.0,15:39
2/1/22,19.3,30.6,0.0,SW,35.0,14:01
3/1/22,14.1,25.9,0.0,S,37.0,17:41
4/1/22,14.2,24.4,0.0,SE,41.0,12:00
5/1/22,14.4,21.5,0.0,S,37.0,13:23
6/1/22,15.8,22.3,0.0,SW,39.0,19:32
7/1/22,16.6,21.0,1.6,SW,43.0,15:40
8/1/22,16.0,24.4,1.6,SW,31.0,14:35
9/1/22,11.7,31.7,0.0,WNW,,13:00
10/1/22,19.8,37.0,0.0,E,52.0,22:43


In [None]:
# Testing Cell (Do NOT modify this cell)


#### Question 16 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5 Points)

Calculate the number of times that the maxmium wind gust came from the direction 'NNW' and assign it to the variable `nnw_gust_16`.

In [38]:
# Write your solution here
nnw_gust_16 = len(dataframe_15[dataframe_15['Direction of maximum wind gust'] == 'NNW'])
nnw_gust_16

0

In [None]:
# Testing Cell (Do NOT modify this cell)

#### Question 17 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(10 Points)

Download the file `titanic.csv` from the MyUni assignment page for this practical and co-locate the file with this notebook. This file contains data on the passengers on the Titanic - including an indication of those who lived and died on 14 April 1912 when it sank. Most columns are self-explanatory except the `Pclass`, `SibSp` and `Parch`.

Load the CSV file into a DataFrame called `dataframe_17` and amend the title of the column `'Pclass'` to `'Passenger Class'`. Also, delete the columns `SibSp` and `Parch` from the DataFrame. Refer to the [API Reference](https://pandas.pydata.org/docs/reference/index.html) for information on renaming and dropping columns if required - and recall from Question 12 that we must specify "inplace" for changes to the DataFrame to persist.

In [46]:
dataframe_17 = pd.read_csv("titanic.csv")
dataframe_17 = dataframe_17.rename(columns={'Pclass':'Passenger Class'})
dataframe_17 = dataframe_17.drop('SibSp', axis=1)
dataframe_17 = dataframe_17.drop('Parch', axis=1)
dataframe_17

Unnamed: 0,PassengerId,Survived,Passenger Class,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C


In [None]:
# Testing Cell (Do NOT modify this cell)

#### Question 18 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(10 Points)

The column `'Survived'` uses 1 and 0 to indicate whether the passenger survived. Create a `dataframe_18` which is a copy of `dataframe_17` and replaces 1 and 0 in the `'Survived'` column with the Python values `True` and `False`.

In [47]:
# Write your solution here
dataframe_18 = dataframe_17.copy()
dataframe_18['Survived'] = dataframe_18['Survived'].replace(1, True)
dataframe_18['Survived'] = dataframe_18['Survived'].replace(0, False)
dataframe_18

Unnamed: 0,PassengerId,Survived,Passenger Class,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,False,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.9250,,S
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
4,5,False,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,False,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S
887,888,True,1,"Graham, Miss. Margaret Edith",female,19.0,112053,30.0000,B42,S
888,889,False,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S
889,890,True,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C


In [None]:
# Testing Cell (Do NOT modify this cell)

#### Question 19 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(15 Points)

To end this practical, work out what percentage of female and male passengers survived. Present these two values in a variable called `survivors_19` that is a tuple `(female_percent, male_percent)` that expresses the percentage of survivors rounded to one decimal place like:
```python
    (72.1, 25.9)  # values are fictional but show the expected format of the result
```
To determine this value, explore the pandas API for the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function.

In [48]:
survival_rate = dataframe_18.groupby('Sex')['Survived'].mean()*100
female_percent = round(survival_rate['female'], 1)
male_percent = round(survival_rate['male'], 1)
survivors_19 = (female_percent, male_percent)
survivors_19

(74.2, 18.9)

In [None]:
# Testing Cell (Do NOT modify this cell)