# Exploring Astronaut Activities in SQL
As any data analyst knows, data is not always pre-processed or easy to visualize. Extracting information and restructuring it is essential to analytics. For that reason in this project I will focus on the use of string cleaning to prepare and process a database containing information on [Astronauts Extra Vehicular Activities (EVAs)](https://www.kaggle.com/datasets/alexandrepetit881234/astronauts-extra-vehicular-activities)


## Exploring our data
Let's start by looking at the table we will be working with.

In [36]:
SELECT *
FROM evas 
ORDER BY duration DESC;

Unnamed: 0,date,country,vehicle,duration,crew,purpose,year,program
0,2001-03-10,USA,STS-102/5A.1,536,"Jim Voss, Susan Helms",Disconnected PMA3 from Node1 electrical cables...,2001,Space Shuttle
1,1992-05-13,USA,STS-49,509,Thuot/Hieb/Akers,3 man EVA. Manually capture/repair INTELSAT,1992,Space Shuttle
2,2012-08-30,USA,ISS Incr-32,497,"Sunita Williams, Akihiko Hoshide",ISS based EVA. Installed 1 of 2 power cables ...,2012,ISS
3,1999-12-22,USA,STS-103/-3A,495,"Steve Smith, John Grunsfeld",HST servicing (RSU gyros and volt/temp improve...,1999,Space Shuttle
4,1999-12-23,USA,STS-103/-3A,490,"Mike Foale, Claude Nicollier",HST servicing (486 computer and fine guidance ...,1999,Space Shuttle
...,...,...,...,...,...,...,...,...
370,1973-05-25,USA,Skylab 2,0,"Paul Weitz, Joe Kerwin, Pete Conrad","After normal docking failed, all donned suits,...",1973,Skylab
371,1973-09-28,Russia,Soyuz 12,0,"Victor Lazerov, Oleg Makarov",New Orlan D suits checked out inside cabin,1973,Soyuz
372,1982-11-14,USA,STS-5,0,"Bill Lenoir, Joe Allen",Suit fan and O2 regulator failures prevented f...,1982,Space Shuttle
373,1998-03-03,Russia,Soyuz TM-27,0,"Talgat Musabeyev, Nikola Budarin",Manual wrenches inadequate to release hatch bo...,1998,Mir


Let's inspect the `purpose` column in greater detail.

In [37]:
SELECT purpose
FROm evas;

Unnamed: 0,purpose
0,First U.S. EVA. Used HHMU and took photos. G...
1,HHMU EVA cancelled before starting by stuck on...
2,"Inadequate restraints, stiff 25ft umbilical an..."
3,Standup EVA. UV photos of stars. Ended by ey...
4,Retrieved MMOD experiment from docked Agena. ...
...,...
370,1 hr late start due to airlock valve. Relocat...
371,"Installed plasma experiment/cables/probes, rep..."
372,Power cable clamps installed and Kurs tested i...
373,"Â Installed VINOSLIVOST experiment on MRM2, 2..."


## What are the most common types of EVAs?
Let's start to get a rough idea of the most popular types of EVAs astronauts take by using [`CASE` expressions](https://www.postgresql.org/docs/current/functions-conditional.html).

In [38]:
SELECT purpose,
CASE WHEN purpose ILIKE '%photos%' THEN 1 ELSE 0 END AS photography, 
CASE WHEN purpose ILIKE '%collect%' THEN 1 ELSE 0 END AS collection,
CASE WHEN purpose ILIKE '%construct%' OR purpose ILIKE '%install%' OR purpose ILIKE '%assembl%' THEN 1 ELSE 0 END AS installation,
CASE WHEN purpose ILIKE '%replace%' OR purpose ILIKE '%fix%' OR purpose ILIKE '%repair%' OR purpose ILIKE '%servic%' THEN 1 ELSE 0 END AS repair
FROM evas;

Unnamed: 0,purpose,photography,collection,installation,repair
0,First U.S. EVA. Used HHMU and took photos. G...,1,0,0,0
1,HHMU EVA cancelled before starting by stuck on...,0,0,0,0
2,"Inadequate restraints, stiff 25ft umbilical an...",0,0,0,0
3,Standup EVA. UV photos of stars. Ended by ey...,1,0,0,0
4,Retrieved MMOD experiment from docked Agena. ...,0,0,0,0
...,...,...,...,...,...
370,1 hr late start due to airlock valve. Relocat...,0,0,1,1
371,"Installed plasma experiment/cables/probes, rep...",0,0,1,1
372,Power cable clamps installed and Kurs tested i...,0,0,1,1
373,"Â Installed VINOSLIVOST experiment on MRM2, 2...",0,0,1,0


We are now ready to build this into a final query!

In [49]:
WITH purposes AS (
	SELECT purpose,
           CASE WHEN purpose ILIKE '%photos%' THEN 1 ELSE 0 END AS photography, 
           CASE WHEN purpose ILIKE '%collect%' THEN 1 ELSE 0 END AS collection,
           CASE WHEN purpose ILIKE '%construct%' OR purpose ILIKE '%install%' OR purpose ILIKE '%assembl%' THEN 1 ELSE 0 END AS installation,
           CASE WHEN purpose ILIKE '%replace%' OR purpose ILIKE '%fix%' OR purpose ILIKE '%repair%' OR purpose ILIKE '%servic%' THEN 1 ELSE 0 END AS repair
FROM evas
)

SELECT 
      SUM(photography) AS count,
	  'photography' AS type
FROM purposes
UNION
SELECT 
      SUM(collection) AS count,
	  'collection' AS type
FROM purposes
UNION
SELECT 
      SUM(installation) AS count,
	  'installation' AS type
FROM purposes
UNION
SELECT 
      SUM(repair) AS count,
	  'repair' AS type
FROM purposes
ORDER BY count DESC;

Unnamed: 0,count,type
0,191,installation
1,129,repair
2,16,collection
3,13,photography


Unnamed: 0,count,type
0,191,installation
1,129,repair
2,16,collection
3,13,photography


## How much material has been extracted from EVAs?
Skimming through the `purpose` column, we also saw numerous references to extracting rock/dust or geological material. In this case, it will be difficult to extract the total quantity across the columns. Regular expressions to the rescue!

We will define a pattern to extract the total pounds extracted per EVA, and then sum them up. Let's first do a sense check of the data.

In [40]:
SELECT purpose
FROM evas 
WHERE purpose ILIKE '%geologic%' OR purpose ILIKE '%rock%';

Unnamed: 0,purpose
0,First to walk on the moon. Some trouble getti...
1,Collected 75.6 lb of geologic material. ALSEP...
2,Collected 94.4 lb of geologic material. ALSEP...
3,Collected 169 lb of geologic material. ALSEP ...
4,Collected 208 lb of rock/dust (41lb this day)....
5,Collected 82 lb of rock/dust. Drove rover 11.5 km
6,Collected 90 lb of rock/dust. Drove rover 27....
7,Collected 243 lb of geologic material. ALSEP ...


Okay, we now know that the format of the pounds extracted is always `number lbs of rock/geologic`. We can construct a pattern to detect this and extract the number!

To do so, we will make use of:

Let's put this into action, using [`SUBSTRING()`](https://www.postgresql.org/docs/9.1/functions-string.html) to extract our pattern!

In [41]:
SELECT purpose, 
          SUBSTRING(purpose, '\d+\.?\d* lb of ((rock)|(geologic))') AS weight
   FROM evas;

Unnamed: 0,purpose,weight
0,First U.S. EVA. Used HHMU and took photos. G...,
1,HHMU EVA cancelled before starting by stuck on...,
2,"Inadequate restraints, stiff 25ft umbilical an...",
3,Standup EVA. UV photos of stars. Ended by ey...,
4,Retrieved MMOD experiment from docked Agena. ...,
...,...,...
370,1 hr late start due to airlock valve. Relocat...,
371,"Installed plasma experiment/cables/probes, rep...",
372,Power cable clamps installed and Kurs tested i...,
373,"Â Installed VINOSLIVOST experiment on MRM2, 2...",


Now we can use a CTE to calculate the total amount!

In [42]:
WITH weights AS (
    SELECT
        purpose,
        SUBSTRING(purpose, '(\d+\.?\d+)\slb\sof\s[rock|geologic]')::NUMERIC AS weight
    FROM evas
    WHERE purpose ILIKE '%rock%' OR purpose ILIKE '%geologic%'
)

SELECT SUM(weight)
FROM weights

Unnamed: 0,sum
0,1008.3


## Which astronauts have the most time in EVAs?
We also have information on how much time each EVA took, as well as the astronauts who participated. Let's use this information to try and calculate totals for each astronaut!

First let's see what the maximum count of astronauts in an EVA has been by [splitting](https://www.postgresql.org/docs/9.1/functions-string.html) the `crew` column. We can also use `TRIM()` to remove any extra whitespace from the column.

In [43]:
SELECT crew,
       SPLIT_PART(crew, ',', 4) AS fourth_astronaut
FROM evas 
WHERE SPLIT_PART(crew, ',', 4) != ''

Unnamed: 0,crew,fourth_astronaut


Now we know how we can extract the time for each astronaut! Let's create a CTE we can then use to piece together `duration` information for each astronaut.

In [44]:
WITH astronauts_split AS (
     SELECT crew,
            SPLIT_PART(crew, ',', 1) AS first_astronaut,
	        SPLIT_PART(crew, ',', 2) AS second_astronaut,
	        SPLIT_PART(crew, ',', 3) AS third_astronaut,
	        duration
	FROM evas
)

SELECT * 
FROM astronauts_split;

Unnamed: 0,crew,first_astronaut,second_astronaut,third_astronaut,duration
0,Ed White,Ed White,,,36
1,David Scott,David Scott,,,0
2,Eugene Cernan,Eugene Cernan,,,127
3,Mike Collins,Mike Collins,,,50
4,Mike Collins,Mike Collins,,,39
...,...,...,...,...,...
370,"Gennady Padalka, Yuri Malenchenko",Gennady Padalka,Yuri Malenchenko,,351
371,"Pavel Vinogradov, Roman Romanenko",Pavel Vinogradov,Roman Romanenko,,398
372,"Fyodor Yurchikhin, Alexander Misurkin",Fyodor Yurchikhin,Alexander Misurkin,,394
373,"Fyodor Yurchikhin, Alexander Misurkin",Fyodor Yurchikhin,Alexander Misurkin,,449


Now it's just a matter of splitting apart this table and appending each set of results to one table.

In [45]:
WITH astronauts_split AS (
     SELECT crew,
            SPLIT_PART(crew, ',', 1) AS first_astronaut,
	        SPLIT_PART(crew, ',', 2) AS second_astronaut,
	        SPLIT_PART(crew, ',', 3) AS third_astronaut,
	        duration
	FROM evas
),

astronaut_durations AS (
	SELECT first_astronaut AS astronaut,
	       duration
	FROM astronauts_split
	WHERE first_astronaut != ''
	UNION ALL
    SELECT second_astronaut AS astronaut,
	duration
	FROM astronauts_split
	WHERE second_astronaut != ''
	UNION ALL
	SELECT third_astronaut AS astronaut,
	duration
	FROM astronauts_split
	WHERE third_astronaut != ''
)

SELECT astronaut, 
       SUM(duration) AS total_duration
FROM astronaut_durations
GROUP BY astronaut
ORDER BY total_duration DESC
LIMIT 10;

Unnamed: 0,astronaut,total_duration
0,Jerry Ross,3501
1,Anatoly Solovyev,3086
2,Scott Parazynski,2825
3,Nikola Budarin,2672
4,John Grunsfeld,2527
5,Mike Lopez-Alegria,2501
6,Mike Fincke,2472
7,Dan Tani,2351
8,Victor Afanasyev,2314
9,Rick Mastracchio,2311


Unnamed: 0,astronaut,total_duration
0,Jerry Ross,3501
1,Anatoly Solovyev,3086
2,Scott Parazynski,2825
3,Nikola Budarin,2672
4,John Grunsfeld,2527
5,Mike Lopez-Alegria,2501
6,Mike Fincke,2472
7,Dan Tani,2351
8,Victor Afanasyev,2314
9,Rick Mastracchio,2311


Disclaimer, limitation of data Anatoly Solovyev. I think that probably not everyone spent the same amount of time outside of their EVA

## What is the cumulative amount of time spent in EVAs over time?
Finally, let's take a look at the cumulative time spent in EVAs by year and space program. To do so, we will need to use a [window function](https://www.postgresql.org/docs/current/tutorial-window.html) in combination with a subquery.

In [51]:
SELECT
   TO_DATE(year::TEXT, 'YYYY') AS year,
   program,
   duration,
   SUM(duration) OVER(PARTITION BY program ORDER BY year) AS cumulative_duration
FROM (
      SELECT year, program, SUM(duration) AS DURATION
      FROM evas 
      GROUP BY year, program
      ORDER BY year, program
	) AS subquery
ORDER BY year, program

Unnamed: 0,year,program,duration,cumulative_duration
0,1965-01-01 00:00:00+00:00,Gemini,36,36
1,1965-01-01 00:00:00+00:00,Voskhod,12,12
2,1966-01-01 00:00:00+00:00,Gemini,720,756
3,1969-01-01 00:00:00+00:00,Apollo,707,707
4,1969-01-01 00:00:00+00:00,Soyuz,37,37
...,...,...,...,...
63,2010-01-01 00:00:00+00:00,Space Shuttle,3591,61140
64,2011-01-01 00:00:00+00:00,ISS,1388,17901
65,2011-01-01 00:00:00+00:00,Space Shuttle,2492,63632
66,2012-01-01 00:00:00+00:00,ISS,2009,19910


Unnamed: 0,year,program,duration,cumulative_duration
0,1965-01-01 00:00:00+00:00,Gemini,36,36
1,1965-01-01 00:00:00+00:00,Voskhod,12,12
2,1966-01-01 00:00:00+00:00,Gemini,720,756
3,1969-01-01 00:00:00+00:00,Apollo,707,707
4,1969-01-01 00:00:00+00:00,Soyuz,37,37
...,...,...,...,...
63,2010-01-01 00:00:00+00:00,Space Shuttle,3591,61140
64,2011-01-01 00:00:00+00:00,ISS,1388,17901
65,2011-01-01 00:00:00+00:00,Space Shuttle,2492,63632
66,2012-01-01 00:00:00+00:00,ISS,2009,19910
