<div class="alert alert-block alert-info" style="background-color: #301E40; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<br/><br/>
<h1 style="font-size: 45px; color: white; align: center;"><center>
<img src="https://raw.githubusercontent.com/HumbleData/beginners-data-workshop/master/media/humble-data-logo-white-transparent.png" width="250px" /><br/><br/>
Data Analysis with Pandas
</center></h1>
</div>

> ***Note***: This notebook contains solution cells with ***a*** solution. Remember there is not only one solution to a problem!  
> 
> You will recognise these cells as they start with **# %**.  
> 
> If you would like to see the solution, you will have to remove the **#** (which can be done by using **Ctrl** and **?**) and run the cell. If you want to run the solution code, you will have to run the cell again.

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Data analysis packages
</h2><br>
</div>

Data Scientists use a wide variety of libraries in Python that make working with data significantly easier. Those libraries primarily consist of:

| Package | Description |
| -- | -- |
| `NumPy` | Numerical calculations - does all the heavy lifting by passing out to C subroutines. This means you get _both_ the productivity of Python, _and_ the computational power of C. Best of both worlds! |
| `SciPy` | Scientific computing, statistic tests, and much more! |
| `pandas` | Your data manipulation swiss army knife. You'll likely see pandas used in any PyData demo! pandas is built on top of NumPy, so it's **fast**. |
| `matplotlib` | An old but powerful data visualisation package, inspired by Matlab. |
| `Seaborn` | A newer and easy-to-use but limited data visualisation package, built on top of matplotlib. |
| `scikit-learn` | Your one-stop machine learning shop! Classification, regression, clustering, dimensional reduction and more. |
| `nltk` and `spacy` | nltk = natural language processing toolkit; spacy is a newer package for natural language processing but very easy to use. |
| `statsmodels` | Statistical tests, time series forecasting and more. The "model formula" interface will be familiar to R users. |
| `requests` and `Beautiful Soup` | `requests` + `Beautiful Soup` = great combination for building web scrapers. |
| `Jupyter` | Jupyter itself is a package too. See the latest version at https://pypi.org/project/jupyter/, and upgrade with e.g. `conda install jupyter==1.0.0` |

Though there are countless others available.

For today, we'll primarily focus ourselves around the library that is 99% of our work: `pandas`. Pandas is built on top of the speed and power of NumPy.

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Imports
</h2><br>
</div>

In [1]:
import pandas as pd

>Import numpy using the convention seen at the end of the first notebook.

In [2]:
import numpy as np

In [None]:
# %load ../solutions/02_01.py

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Loading the data
</h2><br>
</div>

To see a method's documentation, you can use the help function. In Jupyter, you can also just put a question mark before the method.

In [5]:
?pd.read_csv

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m:[0m [0;34m'str | None | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m:[0m [0;34m'str | None | lib.NoDefault'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m:[0m [0;34m"int | Sequence[int] | None | Literal['infer']"[0m [0;34m=[0m [0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m:[0m [0;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m:[0m [0;34m'IndexLabel | Literal[False] | None'[0m [0

To load the dataframe we are using in this notebook, we will provide the path to the file: ../data/Penguins/penguins.csv

>Load the dataframe, read it into a pandas DataFrame and assign it to df

In [6]:
df = pd.read_csv('../data/Penguins/penguins.csv')

In [7]:
# %load ../solutions/02_02.py
df = pd.read_csv("../data/Penguins/penguins.csv")

**To have a look at the first 5 rows of df, we can use the *head* method.**

In [8]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,,,,,,,


>Have a look at the last 3 rows of df using the tail method

In [12]:
df.tail(3)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
352,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
353,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
354,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,Male


In [13]:
# %load ../solutions/02_03.py
df.tail(3)

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
General information about the dataset
</h2><br>
</div>

**To get the size of the datasets, we can use the *shape* attribute.**  
The first number is the number of row, the second one the number of columns

>Show the shape of df (do not put brackets at the end)

In [15]:
df.shape

(355, 7)

In [None]:
# %load ../solutions/02_04.py

>Get the names of the columns and info about them (number of non null and type) using the info method.

In [18]:
dir(df)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__arrow_c_stream__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__dataframe__',
 '__dataframe_consortium_standard__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__r

In [17]:
# %load ../solutions/02_05.py
df.info()

>Get the columns of the dataframe using the columns attribute.

In [20]:
df.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

In [21]:
# %load ../solutions/02_06.py
df.columns

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Display settings
</h2><br>
</div>

We can check the display option of the notebook.

In [24]:
pd.options.display.max_rows

60

>Force pandas to display 25 rows by changing the value of the above.

In [25]:
?pd.options.display.rows

Object `pd.options.display.rows` not found.


In [26]:
# %load ../solutions/02_07.py
pd.options.display.max_rows = 25

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Subsetting data
</h2><br>
</div>

We can subset a dataframe by label, by index or a combination of both.  
There are different ways to do it, using .loc, .iloc and also [].  
See [documentation ](https://pandas.pydata.org/pandas-docs/stable/indexing.html).

>Display the 'bill_length_mm' column

In [36]:
df['bill_length_mm']

0      39.1
1      39.5
2      40.3
3       NaN
4       NaN
       ... 
350    46.8
351    50.4
352    45.2
353    45.2
354    49.9
Name: bill_length_mm, Length: 355, dtype: float64

In [29]:
# %load ../solutions/02_08.py
df["bill_length_mm"]

*Note:* We could also use `df.bill_length_mm`, but it's not the greatest idea because it could be mixed with methods and does not work for columns with spaces.

>Have a look at the 12th observation:

In [47]:
# using .iloc (uses positions, "i" stands for integer)
df.iloc[11]

species                 Adelie
island               Torgersen
bill_length_mm            37.8
bill_depth_mm             17.1
flipper_length_mm        186.0
body_mass_g             3300.0
sex                        NaN
Name: 11, dtype: object

In [39]:
# %load ../solutions/02_09.py
df.iloc[11]

In [40]:
# using .loc (uses indexes and labels)
a = [1, 2, 3]
a[-1]

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x10cca1260>
[0;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)

See more at :

In [48]:
# %load ../solutions/02_10.py
df.loc[11]

species                 Adelie
island               Torgersen
bill_length_mm            37.8
bill_depth_mm             17.1
flipper_length_mm        186.0
body_mass_g             3300.0
sex                        NaN
Name: 11, dtype: object

>Display the **bill_length_mm** of the last three observations.

In [49]:
# using .iloc
?df.iloc

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x10cca11c0>
[0;31mDocstring:[0m  
Purely integer-location based indexing for selection by position.

.. deprecated:: 2.2.0

   Returning a tuple from a callable is deprecated.

``.iloc[]`` is primarily integer position based (from ``0`` to
``length-1`` of the axis), but may also be used with a boolean
array.

Allowed inputs are:

- An integer, e.g. ``5``.
- A list or array of integers, e.g. ``[4, 3, 0]``.
- A slice object with ints, e.g. ``1:7``.
- A boolean array.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above).
  This is useful in method chains, when you don't have a reference to the
  calling object, but would like to base your selection on
  some value.
- A tuple of row and column indexes. The tuple elements consist of one of the
  above inputs, e.g. ``(0, 1)``.

``.iloc`` will raise ``IndexError`` if a request

In [50]:
# %load ../solutions/02_11.py
df.iloc[-3:, 2]

In [51]:
# using .loc
?df.loc

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x10cca1260>
[0;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)

See more at :

In [None]:
# %load ../solutions/02_12.py

And finally look at the **flipper_length_mm** and **body_mass_g** of the 146th, the 8th and the 1rst observations:

In [56]:
# using .iloc
df.iloc

<pandas.core.indexing._iLocIndexer at 0x11fa8b020>

In [54]:
# %load ../solutions/02_13.py
df.iloc[[145, 7, 0], [4, -2]]

In [59]:
# using .loc
df.loc['flipper_length_mm':, :'body_mass_g']

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g


In [60]:
# %load ../solutions/02_14.py
df.loc[[145, 7, 0], ["flipper_length_mm", "body_mass_g"]]

**!!WARNING!!**  Unlike Python and ``.iloc``, the end value in a range specified by ``.loc`` **includes** the last index specified. 

In [61]:
df.iloc[5:10]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
6,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
7,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
8,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
9,Adelie,Torgersen,34.1,18.1,193.0,3475.0,


In [62]:
df.loc[5:10]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
6,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
7,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
8,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
9,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
10,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Filtering data on conditions
</h2><br>
</div>

**We can also use condition(s) to filter.**  
We want to display the rows of df where **body_mass_g** is greater than 4000. We will start by creating a mask with this condition.

In [63]:
mask_PW = df['body_mass_g'] > 4000
mask_PW

0      False
1      False
2      False
3      False
4      False
       ...  
350     True
351     True
352     True
353     True
354     True
Name: body_mass_g, Length: 355, dtype: bool

Note that this return booleans. If we pass this mask to our dataframe, it will display only the rows where the mask is True.

In [64]:
df[mask_PW]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
8,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male
10,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
15,Adelie,Torgersen,34.6,21.1,198.0,4400.0,Male
18,Adelie,Torgersen,42.5,20.7,197.0,4500.0,Male
20,Adelie,Torgersen,46.0,21.5,194.0,4200.0,Male
...,...,...,...,...,...,...,...
350,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
351,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
352,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
353,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


>Display the rows of df where **body_mass_g** is greater than 4000 and **flipper_length_mm** is less than 185.

In [69]:
bmg = df['body_mass_g'] > 4000  
flm = df['flipper_length_mm'] < 185

df[bmg and flm]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [70]:
# %load ../solutions/02_15.py
mask_PW_PL = (df["body_mass_g"] > 4000) & (df["flipper_length_mm"] < 185)
df[mask_PW_PL]

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Values
</h2><br>
</div>

We can get the number of unique values from a certain column by using the `nunique` method.

For example, we can get the number of unique values from the species column:

In [71]:
df['species'].nunique()

3

We can also get the list of unique values from a certain column by using the `unique` method.
>Return the list of unique values from the species column

In [72]:
c = df.nunique()
df = df[c]

KeyError: "None of [Index([3, 3, 164, 80, 55, 94, 2], dtype='int64')] are in the [columns]"

In [73]:
# %load ../solutions/02_16.py
df["species"].unique()

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Null Values and NaN
</h2><br>
</div>

When you work with data, you will quickly learn that data is never "clean". These values are usually referred to as null value. In computation it is best practice to define a "special number" that is "**N**ot **a** **N**umber" also called NaN.

We can use the `isnull` method to know if a value is null or not. It returns boolean values.

In [74]:
df['flipper_length_mm'].isnull()

0      False
1      False
2      False
3       True
4       True
       ...  
350    False
351    False
352    False
353    False
354    False
Name: flipper_length_mm, Length: 355, dtype: bool

**We can apply different methods one after the other.**.  
For example, we could apply to method `sum` after the method `isnull` to know the number of null observations in the **flipper_length_mm** column.
>Get the total number of null values for **flipper_length_mm**.

In [75]:
df['flipper_length_mm'].isnull().sum()

9

In [76]:
# %load ../solutions/02_17.py
df["flipper_length_mm"].isnull().sum()

To get the count of the different values of a column, we can use the `value_counts` method.

For example, for the species column:

In [81]:
df['species'].value_counts(dropna=False)

species
Adelie       153
Gentoo       126
Chinstrap     69
NaN            7
Name: count, dtype: int64

If we want to know the count of NaN values, we have to pass the value `False` to the parameter **dropna** (set to `True` by default).
> Return the proportion for each sex, including the NaN values."

In [83]:
df['sex'].value_counts(normalize=True)

sex
Male      0.501484
Female    0.498516
Name: proportion, dtype: float64

In [84]:
# %load ../solutions/02_18.py
df["sex"].value_counts(dropna=False)

To get the proportion instead of the count of these values, we have to pass the value `True` to the parameter **normalize**.
>Return the proportion for each species.

In [85]:
# %load ../solutions/02_19.py
df["species"].value_counts(normalize=True)

>Using the index attribute, get the indexes of the observation without **flipper_length_mm**

In [86]:
df['flipper_length_mm'].value_counts()

flipper_length_mm
190.0    22
195.0    17
187.0    17
193.0    16
210.0    14
191.0    13
215.0    12
197.0    10
196.0    10
220.0     9
185.0     9
212.0     8
198.0     8
208.0     8
216.0     8
186.0     7
181.0     7
189.0     7
230.0     7
192.0     7
184.0     7
199.0     6
213.0     6
188.0     6
214.0     6
217.0     6
222.0     6
201.0     6
219.0     5
209.0     5
218.0     5
221.0     5
203.0     5
194.0     5
180.0     5
178.0     4
225.0     4
228.0     4
202.0     4
200.0     4
182.0     3
224.0     3
205.0     3
229.0     2
183.0     2
207.0     2
223.0     2
211.0     2
231.0     1
206.0     1
174.0     1
172.0     1
179.0     1
176.0     1
226.0     1
Name: count, dtype: int64

In [88]:
# %load ../solutions/02_20.py
df[df["flipper_length_mm"].isnull()].index

Index([3, 4, 59, 101, 131, 185, 239, 339, 349], dtype='int64')

Use the **[dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)** method to remove the row which only has NaN values.
>Get the help for the dropna method.

In [89]:
df[df["flipper_length_mm"].isnull()].index.value_counts(dropna=True)

3      1
4      1
59     1
101    1
131    1
185    1
239    1
339    1
349    1
Name: count, dtype: int64

In [90]:
# %load ../solutions/02_21.py
?pd.DataFrame.dropna

>Use the dropna method to remove the row of `df` where all of the values are NaN, and assign it to `df_2`.

In [97]:
df_2 = df.dropna(how="all")

In [98]:
# %load ../solutions/02_22.py
df_2 = df.dropna(how="all")

We can use a f-string to format a string. We have to write a `f` before the quotation mark, and write what you want to format between curly brackets.

In [99]:
print(f'shape of df: {df.shape}')

shape of df: (355, 7)


> Print the number of rows of `df_2` using a f_string. Did we lose any rows between `df` and `df_2`? If not, why not?

In [104]:
print(df_2.shape[0])

348


In [103]:
# %load ../solutions/02_23.py
print(f"number of rows of df_2: {df_2.shape[0]}")

>Use the dropna method to remove the rows of `df_2` which contains any NaN values, and assign it to `df_3`

In [105]:
# %load ../solutions/02_24.py
df_3 = df.dropna(how="any")

>Print the number of rows of `df_3` using a f_string.

In [106]:
print(f'rows in df 3: {df_3.shape[0]}')

rows in df 3: 337


In [None]:
# %load ../solutions/02_25.py

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Duplicates
</h2><br>
</div>

>Remove the duplicates rows from `df_3`, and assign the new dataframe to `df_4`

In [108]:
# %load ../solutions/02_26.py
df_4 = df_3.drop_duplicates()

In [109]:
# checking the shape of df_4
df_4.shape

(333, 7)

You should see that 4 rows have been dropped. 

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Some stats
</h2><br>
</div>

>Use the describe method to see how the data is distributed (numerical features only!)

In [111]:
# %load ../solutions/02_27.py
df_4.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057
std,5.468668,1.969235,14.015765,805.215802
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.6,18.7,213.0,4775.0
max,59.6,21.5,231.0,6300.0


We can also change the **species** column to save memory space. Note: You may receive a **SettingWithCopyWarning** - you can safely ignore this error for this notebook.

In [112]:
df_4['species'] = df_4['species'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_4['species'] = df_4['species'].astype('category')


>Using the dtypes attribute, check the types of the columns of `df_4`

In [114]:
# %load ../solutions/02_28.py
df_4.dtypes

species              category
island                 object
bill_length_mm        float64
bill_depth_mm         float64
flipper_length_mm     float64
body_mass_g           float64
sex                    object
dtype: object

We can also use the functions count(), mean(), sum(), median(), std(), min() and max() separately if we are only interested in one of those.

>Get the minimum for each numerical column of `df_4`

In [115]:
# %load ../solutions/02_29.py
df_4.min(numeric_only=True)

>Calculate the maximum of the **flipper_length_mm**

In [116]:
# %load ../solutions/02_30.py
df_4["flipper_length_mm"].max()

We can also get information for each species using the `groupby` method.


> Get the median for each **species**.

In [117]:
# %load ../solutions/02_31.py
df_4.groupby("species").median(numeric_only=True)

---

<div class="alert alert-block alert-warning" style="padding: 0px; padding-left: 20px; padding-top: 5px;"><h2 style="color: #301E40">
Saving the dataframe as a csv file
</h2><br>
</div>

>Save df_4 using this path: `'../data/Penguins/my_penguins.csv'`

In [119]:
# %load ../solutions/02_32.py
df_4.to_csv("../data/Penguins/my_penguins.csv")