# Data Wrangling

In this notebook we will go over the basics of data wrangling. When creating a machine learning model it's important to go through the data you have and make sure it's cleaned and ready to train a model on. This could include things like dropping NA values, removing or taking the average of duplicate rows, or even renaming columns to clarify what the columns mean. 

#### Video

https://www.youtube.com/watch?v=5fMr4mYuCXI&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=19 (Best Practices in Materials Informatics)

## Setup

We'll start by creating a pandas dataframe and populating it with some made up materials data

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a DataFrame with materials properties
materials_data = {'Material': ['Steel', 'Aluminum', 'Copper', 'Titanium', 'Polymer'],
                  'Density (g/cm³)': [7.85, 2.70, 8.96, 4.51, 1.20],
                  'Young\'s Modulus (GPa)': [200, 70, 110, 120, 3],
                  'Melting Point (°C)': [1370, 660, 1085, 1668, 150]}

materials_df = pd.DataFrame(materials_data)

# Display the DataFrame
materials_df


Unnamed: 0,Material,Density (g/cm³),Young's Modulus (GPa),Melting Point (°C)
0,Steel,7.85,200,1370
1,Aluminum,2.7,70,660
2,Copper,8.96,110,1085
3,Titanium,4.51,120,1668
4,Polymer,1.2,3,150


Visualizing Materials Data

You can visualize the first few rows of the materials properties DataFrame.


In [2]:
# Visualize the first few rows of the DataFrame
materials_df.head()


Unnamed: 0,Material,Density (g/cm³),Young's Modulus (GPa),Melting Point (°C)
0,Steel,7.85,200,1370
1,Aluminum,2.7,70,660
2,Copper,8.96,110,1085
3,Titanium,4.51,120,1668
4,Polymer,1.2,3,150


Selecting Materials Properties

To select specific materials properties, you can use double square brackets.

In [3]:
# Selecting specific properties
density_youngs_modulus = materials_df[['Material', 'Density (g/cm³)', 'Melting Point (°C)']]
density_youngs_modulus


Unnamed: 0,Material,Density (g/cm³),Melting Point (°C)
0,Steel,7.85,1370
1,Aluminum,2.7,660
2,Copper,8.96,1085
3,Titanium,4.51,1668
4,Polymer,1.2,150


Modifying Materials Data

You can change individual properties of materials using the at method.

In [4]:
# Changing individual properties
materials_df.at[4, 'Density (g/cm³)'] = 1.25
materials_df


Unnamed: 0,Material,Density (g/cm³),Young's Modulus (GPa),Melting Point (°C)
0,Steel,7.85,200,1370
1,Aluminum,2.7,70,660
2,Copper,8.96,110,1085
3,Titanium,4.51,120,1668
4,Polymer,1.25,3,150


Merging Materials DataFrames

You can merge materials properties DataFrames based on a common column.

In [5]:
# Merging with additional properties
thermal_conductivity = {'Material': ['Steel', 'Aluminum', 'Copper', 'Titanium', 'Polymer'],
                        'Thermal Conductivity (W/mK)': [50, 205, 398, 21, 0.2]}

thermal_conductivity_df = pd.DataFrame(thermal_conductivity)

merged_materials_df = pd.merge(materials_df, thermal_conductivity_df, on='Material')
merged_materials_df


Unnamed: 0,Material,Density (g/cm³),Young's Modulus (GPa),Melting Point (°C),Thermal Conductivity (W/mK)
0,Steel,7.85,200,1370,50.0
1,Aluminum,2.7,70,660,205.0
2,Copper,8.96,110,1085,398.0
3,Titanium,4.51,120,1668,21.0
4,Polymer,1.25,3,150,0.2


Concatenating Materials DataFrames

You can concatenate materials properties DataFrames vertically using the concat function.

In [6]:
# Concatenating with new materials
new_materials = {'Material': ['Graphene', 'Ceramic'],
                 'Density (g/cm³)': [2.26, 3.2],
                 'Young\'s Modulus (GPa)': [1100, 300],
                 'Melting Point (°C)': [4000, 2000]}

new_materials_df = pd.DataFrame(new_materials)

concatenated_materials_df = pd.concat([merged_materials_df, new_materials_df], ignore_index=True)
concatenated_materials_df


Unnamed: 0,Material,Density (g/cm³),Young's Modulus (GPa),Melting Point (°C),Thermal Conductivity (W/mK)
0,Steel,7.85,200,1370,50.0
1,Aluminum,2.7,70,660,205.0
2,Copper,8.96,110,1085,398.0
3,Titanium,4.51,120,1668,21.0
4,Polymer,1.25,3,150,0.2
5,Graphene,2.26,1100,4000,
6,Ceramic,3.2,300,2000,


Selecting Materials Based on Criteria

You can filter the materials properties DataFrame based on certain criteria.

In [7]:
# Selecting materials based on criteria
high_modulus_materials = materials_df[materials_df['Young\'s Modulus (GPa)'] > 100]
high_modulus_materials


Unnamed: 0,Material,Density (g/cm³),Young's Modulus (GPa),Melting Point (°C)
0,Steel,7.85,200,1370
2,Copper,8.96,110,1085
3,Titanium,4.51,120,1668


Remember, if you can't remember the syntax for how to do something, it's really easy to ask Github Copilot (ctrl+I in a new code block) or ChatGPT to do basic and sometimes even not so basic tasks.


In [8]:
# Grouping data by high density and increasing melting point
grouped_data = concatenated_materials_df.sort_values(['Density (g/cm³)', 'Melting Point (°C)'], ascending=[False, True])
grouped_data


Unnamed: 0,Material,Density (g/cm³),Young's Modulus (GPa),Melting Point (°C),Thermal Conductivity (W/mK)
2,Copper,8.96,110,1085,398.0
0,Steel,7.85,200,1370,50.0
3,Titanium,4.51,120,1668,21.0
6,Ceramic,3.2,300,2000,
1,Aluminum,2.7,70,660,205.0
5,Graphene,2.26,1100,4000,
4,Polymer,1.25,3,150,0.2
