# Building a Possum Regression and Classification Model
*By Stephen FitzSimon*

In [1]:
import pandas as pd
import acquire



## Contents <a name='contents'></a>

1. <a href='#introduction'>Introduction</a>
2. <a href='#acquire'>Acquire the Data</a>
3. <a href='#explore'>Explore the Data</a>
4. <a href='#model'>Model the Data</a>
5. <a href='#conclusion'>Conclusion</a>

<img src='https://upload.wikimedia.org/wikipedia/commons/e/e9/Trichosurus_caninus_Gould.jpg'></img>

## Introduction <a name='introduction'></a>

1. <a href='#sources'>Sources</a>
2. <a href='#about_data'>About The Data</a>

The Bushtail possum is a native Australian possum found along the East coast of the continent.  The following data was collected in 1995 by Lindenmayer; at this point in time it was classified with the cloesly related <a href='https://en.wikipedia.org/wiki/Mountain_brushtail_possum'>Mountain Brushtail Possum (*Trichosurus cunninghami*)</a>.  As a member of the Trichosurus tribe, they are considered more at home on the ground than other members of the Phalangeridae family, yet they remain predominately leaf eaters.

The goal of this project is to explore the anatomical characteristics of the species and develop a linear regression model to predict an individual's age, and a classification model to predict an individual's sex. 

#### More Information On The Species and The Phalangeridae Family

- <a href='https://en.wikipedia.org/wiki/Short-eared_possum'>Species information on Wikipedia</a>

- <a href='https://en.wikipedia.org/wiki/Mountain_brushtail_possum'>Wikipedia informaton on the closely related Mountain brushtail possum</a> (note: before 2002 the two species were thought to be a single species)

- <a href='https://en.wikipedia.org/wiki/Phalangeridae'>Wikipedia information on the Phalangeridae family</a>

- <a href='https://www.theage.com.au/national/a-tail-of-two-possums-20041203-gdz4bq.html'>A tail of two possums - The Age (Melbourne)</a>

- <a href='https://www.iucnredlist.org/species/40557/21951945'>Conservation information at Red List</a>

- <a href='https://www.departments.bucknell.edu/biology/resources/msw3/browse.asp?s=y&id=11000086'>Entry at Mammal Species of the World</a>

- <a href='https://www.youtube.com/watch?v=Cwg2rTorJWc'>Video by Brave Wilderness on the Related Bushtail Possum</a>

### Sources <a name='sources'></a>

*Original Paper*

Lindenmayer DB , Viggers KL , Cunningham RB Donnelly CF (1995) Morphological Variation Among Populations of the Mountain Brushtail Possum, Trichosurus-Caninus Ogilby (Phalangeridae, Marsupialia). *Australian Journal of Zoology* 43, 449-458. https://doi.org/10.1071/ZO9950449

*Kaggle Dataset*

https://www.kaggle.com/datasets/abrambeyer/openintro-possum

*Australia Bureau of Statistics Digital Boundary Files*

https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### About the Data <a name='#about_data'></a>

*Note: original column names can be found on the kaggle page for the data.  The column names made in the `acquire.py` module are used for the data dictionary.  The information to clean up the data can be found either in the original paper by Lindenmayer or from the documentation on the <a href='https://cran.r-project.org/web/packages/DAAG/index.html'>DAAG dataset on CRAN</a>*

- `case` : observation/identification number of individual
- `trap_site` : the id number of the site where the individual was trapped; they are as follows:
    - Cambarville, Victoria
    - Bellbird, Victoria
    - Whian Whian State Forest, NSW
    - Byrangery Reserve, NSW
    - Conondale Ranges, Queensland
    - Bulburin State Forest, Queensland
    - Allyn River Forest Park, NSW
- `state` : the Australian state of the `trap_site` location 
- `sex` : the sex of the individual
- `age` : the age of the individual in years, determined by tooth wear (Lindenmayer)
- `head_length` : length of the head from the nose tip to the external occipital protuberance in mm
- `skull_width` : the width of the skull at the widest part in mm
- `total_length` : length of the body from the nose tip to the tain end in mm
- `tail_length` : length from tail base to tail tip in mm
- `foot_length` : length from heel to longest toe's tip in mm
- `ear_length` : length from the base of the ear to the tip of the ear
- `eye_width` : the width of the eye from medial to lateral canthus
- `chest_girth` : girth behind the forelimbs in mm
- `belly_girth` : girth behind the last rib in mm
- `latitude`
- `longitude`
- `elevation` : height above sea level in meters

<a href='#contents'>Back to Contents</a>

## Acquire The Data <a name='acquire'></a>

- Columns are renamed to be more user friendly.
- `trap_site`, `state`, `latitude`, `longitude` and `elevation` are added based off of Lindenmayer and DAAG documentation.
- `total_length`, `tail_length`, `chest_girth`, and `belly_girth` are all converted from centimeters to millimeters.
- Three total rows are dropped because of missing data.  Two did not include data in  `sex` which is a target column.
- `latitude`, `longitude` and `elevation` are added with information from Lindenmayer and DAAG data set on CRAN.

This section uses the `acquire.py` module to make and prepare the dataset after it is downloaded.  It uses the following functions:

- `make_dataset()` : A flow control function that loads the dataset from the `possum.csv` file and calls the module's other functions to prepare the data set.
- `get_dataset()` : Reads the `possum.csv` and returns it as a dataframe.  If no file is present, an error is printed to the user informating them that they need to download the data from Kaggle.

In [2]:
df = acquire.make_dataset()

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101 entries, 0 to 103
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   case          101 non-null    int64  
 1   trap_site     101 non-null    object 
 2   state         101 non-null    object 
 3   sex           101 non-null    object 
 4   age           101 non-null    float64
 5   head_length   101 non-null    float64
 6   skull_width   101 non-null    float64
 7   total_length  101 non-null    float64
 8   tail_length   101 non-null    float64
 9   foot_length   101 non-null    float64
 10  ear_length    101 non-null    float64
 11  eye_width     101 non-null    float64
 12  chest_girth   101 non-null    float64
 13  belly_girth   101 non-null    float64
 14  latitude      101 non-null    float64
 15  longitude     101 non-null    float64
 16  elevation     101 non-null    int64  
dtypes: float64(12), int64(2), object(3)
memory usage: 14.2+ KB


In [4]:
df

Unnamed: 0,case,trap_site,state,sex,age,head_length,skull_width,total_length,tail_length,foot_length,ear_length,eye_width,chest_girth,belly_girth,latitude,longitude,elevation
0,1,Cambarville,Victoria,m,8.0,94.1,60.4,890.0,360.0,74.5,54.5,15.2,280.0,360.0,-37.550000,145.883300,800
1,2,Cambarville,Victoria,f,6.0,92.5,57.6,915.0,365.0,72.5,51.2,16.0,285.0,330.0,-37.550000,145.883300,800
2,3,Cambarville,Victoria,f,6.0,94.0,60.0,955.0,390.0,75.4,51.9,15.5,300.0,340.0,-37.550000,145.883300,800
3,4,Cambarville,Victoria,f,6.0,93.2,57.1,920.0,380.0,76.1,52.2,15.2,280.0,340.0,-37.550000,145.883300,800
4,5,Cambarville,Victoria,f,2.0,91.5,56.3,855.0,360.0,71.0,53.2,15.1,285.0,330.0,-37.550000,145.883300,800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,100,Allyn River Forest Park,New South Wales,m,1.0,89.5,56.0,815.0,365.0,66.0,46.8,14.8,230.0,270.0,-32.116667,151.466667,300
100,101,Allyn River Forest Park,New South Wales,m,1.0,88.6,54.7,825.0,390.0,64.4,48.0,14.0,250.0,330.0,-32.116667,151.466667,300
101,102,Allyn River Forest Park,New South Wales,f,6.0,92.4,55.0,890.0,380.0,63.5,45.4,13.0,250.0,300.0,-32.116667,151.466667,300
102,103,Allyn River Forest Park,New South Wales,m,4.0,91.5,55.2,825.0,365.0,62.9,45.9,15.4,250.0,290.0,-32.116667,151.466667,300
