# Data Manipulation

### Importing libraries

In [1]:
# Import the Pandas library

import pandas as pd

### Loading the data

In [2]:
# Load the data in from an excel file; specify which sheet by name or number

plastics = pd.read_excel("plastic_data.xlsx", sheet_name=1)

# Possible import problems & solutions
# permission error --> if the file is open in another program, close it
# filenotfound error --> move the file to the same directory
# filenotfound error --> check the file name

In [3]:
# View the data

plastics

#### "Tidy" data

- Rows should contain the things that you are interested in.
- Each column should contain only one type of data.
- Each column should record one observation about each row

In [4]:
# Transpose (flip) the dataset

plastics.T

#### Making changes to data in Pandas

By default, Pandas always edits a **copy** of your data.

You have to explicitly overwrite the original.

In [5]:
# Transpose the dataset and overwrite the original

plastics = plastics.T

# Overwriting produces no output; no value is created and left hanging.

### Inspecting the data

In [6]:
# Look at the first rows of the dataset (default is 5)

plastics.head()

In [7]:
# Look at the last rows of the dataset (default is 5)

plastics.tail()

In [8]:
# Look at random rows (1 by default)

plastics.sample()

#### Dataframes

A dataframe stores information in rows and columns

A series is a data structure that holds values from a single row or column

In [9]:
# Check the type of plastics

type(plastics)

In [10]:
# Check the type of a column

type(plastics[0])

#### Methods and attributes

A dataframe is a kind of data structure and datastructures are **objects**.

An object has:
- things it knows (attributes)
- things it can do (methods)
    
Attributes don't need brackets. 

Methods - because they are bound functions - require brackets

Dataframes are useful because they have so many built-in methods/attributes to speed up data analysis.

In [11]:
# Check the size/shape of a dataframe with the .shape attribute

plastics.shape

# Rows, columns

### Subsetting a dataframe by index location

In [12]:
# Use .iloc to select rows by position number
# Use square brackets

plastics.iloc[7:9]

# df.iloc[start=0 : end=max]
# end is an exclusive boundary

In [13]:
# Select rows and columns by position number

plastics.iloc[:, 4]

# df.iloc[row_start=0 : row_end=max , col_start=0 : col_end=max]

In [14]:
# Select a specific set of columns by index location

plastics.iloc[[1, 4, 156], [0, 3, 6]]

### Fixing the column headings

In [15]:
# Access the current columns with an attribute

plastics.columns

In [16]:
# Use .iloc to get all of the values in row 0

new_cols = plastics.iloc[0]

In [17]:
# Overwrite the columns with the new columns

plastics.columns = new_cols

In [18]:
# Use .iloc to select all the rows except row 0
# Overwrite the original dataframe, removing row 0

plastics = plastics.iloc[1:]

In [19]:
plastics.head(3)

### Fixing the row index 

In [20]:
# Reset the index, shuffling the current row labels into the data
# and adding a numeric index

plastics.reset_index(inplace=True)

# Edit the original, not a copy

In [21]:
plastics.tail()

#### Guidelines for column headings

- short
- precise
- contain no spaces
- relevant
- lowercase
- clearly distinct from each other
- no more detailed than they need to be for that dataset

In [22]:
# Rename columns
# A dict of old:new column names

plastics.rename(columns={"index": "country",
                         "PCapita plastic waste (kg per person per day)": "plastic_pc",
                         "otalpopulatio": "total_pop",
                         "Mismanaged plastic waste (tonnes)" : "plastic_mm"},
                inplace=True)

In [23]:
plastics.head()

### Selecting columns 

In [24]:
# Select a column by name

plastics["gdp_pc"]

# Select a set of columns by name

plastics[["gdp_pc", "total_pop", "country"]]

# Order doesn't matter

### Cleaning the data

In [25]:
# Remove a character from a column
# Treat the column as a set of strings
# In each string, replace $ with nothing
# Overwrite the original column

plastics["gdp_pc"] = plastics["gdp_pc"].str.replace("$", "")

In [26]:
plastics.head()

### Dealing with missing values 

In [27]:
# Find all the NA values

plastics.isna().sum()

#### Strategies for NaN values

Missing values have to be dealt with as part of cleaning, but there's no easy solution: this is a matter of judgement, rather than something with a clear best answer. 

There are two main ways to deal with missing data:

1. Remove it: get rid of rows with missing values. This ensures data quality and consistency, but means that any future analysis will be performed on a smaller and less generalisable dataset.

2. Assume it: provide a filler value for any gaps, such as 0, or the average of the relevant column. This ensures that you keep as much data as possible, but means that your data is less reliable. Importantly, this should only ever be done when you can make **reasonable** assumptions about a default value.

Either strategy can be appropriate, depending on the situation. What's important is that you

1. Think carefully about the decision and its implications
2. Explain in any report what decisions you made (and why)
3. Acknowledge the impact of your decisions on your results

In [28]:
# Deal with the NA values

# You can fill all the gaps in a column with a value (shown below, commented out), but should be careful with this.

# plastics["gdp_pc"].fillna()

# Drop all the rows with any missing values

plastics.dropna(inplace=True)

### Changing data types 

In [29]:
# Check the types

plastics.dtypes

In [30]:
# Convert a column to numeric by letting Pandas make assumptions

plastics["plastic_pc"] = pd.to_numeric(plastics["plastic_pc"])

In [31]:
# Convert a column to a specific type

plastics["gdp_pc"] = plastics["gdp_pc"].astype(float)

In [32]:
# Convert all columns that should be numeric to be numeric.

plastics[["total_pop", "year",
          "plastic_mm", "coastal_pop"]] = plastics[["total_pop", "year",
                                                    "plastic_mm", "coastal_pop"]].astype(int)

In [33]:
# Check the types once more

plastics.dtypes

### High-level overview methods

In [34]:
# Now that we have numeric columns, we can generate summary statistics

plastics.describe()

In [35]:
# See a range of information about the dataframe, including the count of each column type

plastics.info()

### Column methods

In [36]:
# Round a column to 2 d.p. and overwrite the orignal

plastics["gdp_pc"] = plastics["gdp_pc"].round(2)

In [37]:
# Get the number of unique values in a column (good for categories)

plastics["year"].nunique()

In [38]:
# Get the list of unique values in a column

plastics["year"].unique()

In [39]:
# Get the number of values for each unique value in a column

plastics["year"].value_counts()

In [40]:
# Get the average of a column

plastics["gdp_pc"].mean()

# max, min, std, count, sum and various others are also available

### Sorting a dataframe

In [41]:
# Sort a dataframe based on a column, overwriting the original
# by default, the sort is always in ascending order

plastics.sort_values(by="plastic_mm", ascending=False, inplace=True)

In [42]:
plastics.head()

### Filtering a dataframe

In [43]:
# Create a true/false filter based on a condition

bool_filter = plastics["gdp_pc"] < 1000

In [44]:
# Return every row where the filter says True

plastics[bool_filter]

In [45]:
# One-line format

plastics[plastics["gdp_pc"] < 1000]

In [46]:
# Filter for all the rows where total_pop is less 40000

plastics[plastics["total_pop"] < 40000]

# Filter for all the rows where the total_pop is greater than the average

plastics[plastics["total_pop"] > plastics["total_pop"].mean()]

# Filter for all the rows where the total_pop is less than the coastal_pop

plastics[plastics["total_pop"] < plastics["coastal_pop"]]

In [47]:
# Filter for multiple conditions using brackets and & (for and) or | (for or)

plastics[(plastics["gdp_pc"] > 1000) & (plastics["total_pop"] < 50000)]

### Writing clean data to a file 

In [48]:
# Write your clean dataset to a file, ignoring the index column

plastics.to_csv("clean_plastics.csv", index=False)