# Pandas Library

<h2>Pandas Explained</h2>

Pandas is a Python library that is used for data manipulation and analysis. It provides data structures and functions that simplify working with structured data, such as tables or spreadsheets.

In [1]:
import pandas as pd

<h2>Pandas Series</h2>

Series is a one-dimensional labeled array capable of holding any data type (e.g., integers, floats, strings, etc.). It's similar to a column in a spreadsheet or a single list in Python.

In [11]:
import numpy as np 

data = np.array([10,20,30,40,50])
series = pd.Series(data)
print("Series from NumPy array")
print(series)
print()

Series from NumPy array
0    10
1    20
2    30
3    40
4    50
dtype: int32



You can access pertinent attributes and methods as well for series:

In [8]:
# attributes
print("Index: ", series.index)
print("Values: ", series.values)

# methods
print("Sum of values: ", series.sum())
print("Maximum value: ", series.max())
print("Minimum value: ", series.min())
print()

Index:  RangeIndex(start=0, stop=5, step=1)
Values:  [10 20 30 40 50]
Sum of values:  150
Maximum value:  50
Minimum value:  10



<h2>Pandas DataFrames</h2>

A Pandas DataFrame is a two-dimensional labeled data structure with rows and columns, similar to a spreadsheet.

In [10]:
# Creating a DataFrame from a dictionary
data = {
        "Name": ["John", "Emma", "Peter"],
        "Age": [30, 25, 35],
        "City": ["New York", "London", "Paris"]
       }

df = pd.DataFrame(data)

print("DataFrame created from dictionary: ")
print(df)
print()

DataFrame created from dictionary: 
    Name  Age      City
0   John   30  New York
1   Emma   25    London
2  Peter   35     Paris



You can access rows, columns, or individual elements using indexing, slicing, or label-based methods.

In [16]:
# Accessing columns
print("Accessing columns: ")  
print("---------------------")
print(df["Name"])  # accessing a single column
print()
print(df[["Name", "Age"]])  # accessing multiple columns
print()

# Accessing rows
print("Accessing rows:")
print("---------------------")
print(df.loc[1])  # accessing a single row by label
print()
print(df.iloc[0])  # accessing a single row by index
print()

# Accessing elements
print("Accessing elements: ")
print("---------------------")
print(df.at[0, "Age"])  # accessing a single element by label
print()
print(df.iat[1, 2])  # accessing a single element by index
print()

Accessing columns: 
---------------------
0     John
1     Emma
2    Peter
Name: Name, dtype: object

    Name  Age
0   John   30
1   Emma   25
2  Peter   35

Accessing rows:
---------------------
Name      Emma
Age         25
City    London
Name: 1, dtype: object

Name        John
Age           30
City    New York
Name: 0, dtype: object

Accessing elements: 
---------------------
30

London



You can perform various manipulations on DataFrames, such as adding or removing columns, filtering rows, and applying functions.

In [21]:
# adding a new column
df["Gender"] = ["Male", "Female", "Male"]
print("DataFrame after adding a new column: ")
print(df)
print()

# filtering rows based on a condition
filtered_df = df[df["Age"] > 25]
print("Filtered DataFrame:")
print(filtered_df)
print()

# applying a function to a column
df["Age"] = df["Age"].apply(lambda x: x + 1)
print("DataFrame after applying function to 'Age' column: ")
print(df)
print()

DataFrame after adding a new column: 
    Name  Age      City  Gender
0   John   33  New York    Male
1   Emma   28    London  Female
2  Peter   38     Paris    Male

Filtered DataFrame:
    Name  Age      City  Gender
0   John   33  New York    Male
1   Emma   28    London  Female
2  Peter   38     Paris    Male

DataFrame after applying function to 'Age' column: 
    Name  Age      City  Gender
0   John   34  New York    Male
1   Emma   29    London  Female
2  Peter   39     Paris    Male



# 1. Planetary Analysis with Pandas

Imagine that you are a part of a team on analyzing planetary Earth data. Your task is to process and analyze the data using Pandas to extract insights.

<h2>Objectives:</h2>

- Create a Pandas Series representing distances of planets from the Sun (in million kilometers)
- Create a Pandas DataFrame representing characteristics of moons of the outer planets
- Analyze the data to find key information about planetary distances and moon characteristics

In [6]:
planet_distances = {
    "Mercury": 57.9,
    "Venus": 108.2,
    "Earth": 149.6,
    "Mars": 227.9,
    "Jupiter": 778.6,
    "Saturn": 1433.5,
    "Uranus": 2872.5,
    "Neptune": 4495.1,
    "Pluto": 5906.4,
}

moon_data = {
    "Planet": ["Jupiter", "Jupiter", "Saturn", "Saturn", "Uranus", "Neptune"],
    "Moon": ["Io", "Ganymede", "Titan", "Rhea", "Titania", "Triton"],
    "Diameter (km)": [3642, 5262, 5150, 1528, 1578, 2707],
    "Orbital Period (days)": [1.77, 7.15, 15.95, 4.52, 8.71, 5.88]
}

In [7]:
import pandas as pd

distances_series = pd.Series(planet_distances)
moon_characteristics_df = pd.DataFrame(moon_data)
print(moon_characteristics_df)

    Planet      Moon  Diameter (km)  Orbital Period (days)
0  Jupiter        Io           3642                   1.77
1  Jupiter  Ganymede           5262                   7.15
2   Saturn     Titan           5150                  15.95
3   Saturn      Rhea           1528                   4.52
4   Uranus   Titania           1578                   8.71
5  Neptune    Triton           2707                   5.88


In [62]:
# analyze the data
print("Average distance of planets from the Sun: ")
print(distances_series.mean())
print()

outer_planets = ["Jupiter", "Saturn", "Uranus", "Neptune"]
print("Outer planets: ", outer_planets)
print()

print("Number of moons for each outer planet: ")

for planet in outer_planets:
    num_moons = moon_characteristics_df[moon_characteristics_df["Planet"] == planet].shape[0]
    print(f"{planet}: {num_moons}")
print()

print("Largest moon of each outer planet: ")

for planet in outer_planets:
    largest_moon = moon_characteristics_df[moon_characteristics_df["Planet"] == planet].sort_values(by="Diameter (km)", ascending=False).iloc[0]
    print(f"{planet}: {largest_moon["Moon"]} ({largest_moon["Diameter (km)"]} km)")

Average distance of planets from the Sun: 
1781.0777777777778

Outer planets:  ['Jupiter', 'Saturn', 'Uranus', 'Neptune']

Number of moons for each outer planet: 
Jupiter: 2
Saturn: 2
Uranus: 1
Neptune: 1

Largest moon of each outer planet: 
Jupiter: Ganymede (5262 km)
Saturn: Titan (5150 km)
Uranus: Titania (1578 km)
Neptune: Triton (2707 km)


# Working with Real Life Data

Pandas allows us to work with a wide range of file formats from external sources. The code below allows us to import data via CSV and display information about it.

The below downloads a CSV file of the most recent SpaceX Missions via a URL.

In [8]:
spacex_missions_csv = "https://raw.githubusercontent.com/BriantOliveira/SpaceX-Dataset/master/dataset/SpaceX-Missions.csv"
df = pd.read_csv(spacex_missions_csv)

You can gauge more information about a particular dataset using the following methods: 

In [77]:
# display the first few rows of the DataFrame
print(df.head())
print()

# check basic information about the DataFrame
print(df.info())
print()

# summary statistics of numerical columns
print(df.describe())
print()

  Flight Number    Launch Date Launch Time       Launch Site Vehicle Type  \
0          F1-1  24 March 2006       22:30  Marshall Islands     Falcon 1   
1          F1-2  21 March 2007       01:10  Marshall Islands     Falcon 1   
2          F1-3  3 August 2008       03:34  Marshall Islands     Falcon 1   
3          F1-3  3 August 2008       03:34  Marshall Islands     Falcon 1   
4          F1-3  3 August 2008       03:34  Marshall Islands     Falcon 1   

         Payload Name             Payload Type  Payload Mass (kg)  \
0         FalconSAT-2       Research Satellite               19.5   
1             DemoSat                      NaN                NaN   
2         Trailblazer  Communication Satellite                NaN   
3  PRESat, NanoSail-D      Research Satellites                8.0   
4           Explorers            Human Remains                NaN   

  Payload Orbit Customer Name Customer Type Customer Country Mission Outcome  \
0           NaN         DARPA    Governmen

# Core Methods with Pandas

<h2>Mathematical & Statistical Methods</h2>

You can apply basic statistical and mathematical functions such as <code>mean</code>, <code>median</code>, <code>sum</code>, <code>std</code> (standard deviation), <code>var</code> (variance), and more!

In [9]:
mean_payload_mass = df["Payload Mass (kg)"].mean()
print("Payload mass mean: ", mean_payload_mass)

median_payload_mass = df["Payload Mass (kg)"].median()
print("Payload mass median: ", median_payload_mass)

sum_payload_mass = df["Payload Mass (kg)"].sum()
print("Payload mass sum: ", sum_payload_mass)

std_payload_mass = df["Payload Mass (kg)"].std()
print("Payload mass std: ", std_payload_mass)

var_payload_mass = df["Payload Mass (kg)"].var()
print("Payload mass var: ", var_payload_mass)

Payload mass mean:  2739.7727272727275
Payload mass median:  2490.0
Payload mass sum:  90412.5
Payload mass std:  2131.502972856349
Payload mass var:  4543304.923295454


<h2>Exploratory Methods</h2>

Explorative methods in Pandas allows you to learn more about the dataset at hand. Some of these methods include <code>describe</code> and <code>unique</code>.

In [12]:
# print all unique launch sites in the dataset
print("Unique launch sites:")
print(pd.unique(df["Launch Site"]))
print()

# describe the statistical values of all the numerical columns within the dataset
df.describe(include=[np.number])

Unique launch sites:
['Marshall Islands' 'Cape Canaveral AFS LC-40' 'Vandenberg AFB SLC-4E'
 'Kennedy Space Center LC-39A']



Unnamed: 0,Payload Mass (kg)
count,33.0
mean,2739.772727
std,2131.502973
min,8.0
25%,570.0
50%,2490.0
75%,4159.0
max,9600.0


<h2>Data Selection and Filtering</h2>

In [13]:
# select specific columns
selected_columns = df[["Payload Name", "Payload Orbit"]]
print(selected_columns)
print()

# filter rows based on condition (if payload mass exceeds 3000)
filtered_payloads = df[df["Payload Mass (kg)"] > 3000]
print(filtered_payloads)

                            Payload Name                 Payload Orbit
0                            FalconSAT-2                           NaN
1                                DemoSat                           NaN
2                            Trailblazer                           NaN
3                     PRESat, NanoSail-D                           NaN
4                              Explorers                           NaN
5                       RatSat (DemoSat)               Low Earth Orbit
6                               RazakSAT               Low Earth Orbit
7   Dragon Spacecraft Qualification Unit               Low Earth Orbit
8                 SpaceX CRS (Dragon C1)               Low Earth Orbit
9                SpaceX CRS (Dragon C2+)               Low Earth Orbit
10                          SpaceX CRS-1               Low Earth Orbit
11                           Orbcomm-OG2               Low Earth Orbit
12                          SpaceX CRS-2               Low Earth Orbit
13    

<h2>Grouping Data</h2>

Another cool feature in Pandas is <code>groupby</code> which allows you to categorize data and perform analysis on it.

In [14]:
# grouping payload by launch site
launch_groups = df.groupby("Launch Site")

# obtaining the mean value for each group
launch_groups["Payload Mass (kg)"].mean()

Launch Site
Cape Canaveral AFS LC-40       3075.880
Kennedy Space Center LC-39A    2490.000
Marshall Islands                 93.125
Vandenberg AFB SLC-4E          3551.000
Name: Payload Mass (kg), dtype: float64

# 2. Deeper Dive into SpaceX Launch Dataset

<h2>Objectives</h2>

- Reinitialize the dataframe and name it <code>launches_dataset</code>
- View the first values of the dataset using the <code>head()</code> method
- Print out the "Customer Country" column
- Print out all the unique customer countries
- Make a variable named <code>launches_by_country</code> and make it grouped by the "Customer Country" column
- Print out all the launches with a payload_mass ("Payload Mass (kg)") of less than 4000
- Print the median payload mass for <code>United States</code>

In [42]:
import pandas as pd

spacex_missions_csv = "https://raw.githubusercontent.com/BriantOliveira/SpaceX-Dataset/master/dataset/SpaceX-Missions.csv"
launches_dataset = pd.read_csv(spacex_missions_csv)

In [41]:
launches_dataset.head()

Unnamed: 0,Flight Number,Launch Date,Launch Time,Launch Site,Vehicle Type,Payload Name,Payload Type,Payload Mass (kg),Payload Orbit,Customer Name,Customer Type,Customer Country,Mission Outcome,Failure Reason,Landing Type,Landing Outcome
0,F1-1,24 March 2006,22:30,Marshall Islands,Falcon 1,FalconSAT-2,Research Satellite,19.5,,DARPA,Government,United States,Failure,Engine Fire During Launch,,
1,F1-2,21 March 2007,01:10,Marshall Islands,Falcon 1,DemoSat,,,,DARPA,Government,United States,Failure,Engine Shutdown During Launch,,
2,F1-3,3 August 2008,03:34,Marshall Islands,Falcon 1,Trailblazer,Communication Satellite,,,ORS,Government,United States,Failure,Collision During Launch,,
3,F1-3,3 August 2008,03:34,Marshall Islands,Falcon 1,"PRESat, NanoSail-D",Research Satellites,8.0,,NASA,Government,United States,Failure,Collision During Launch,,
4,F1-3,3 August 2008,03:34,Marshall Islands,Falcon 1,Explorers,Human Remains,,,Celestis,Business,United States,Failure,Collision During Launch,,


In [53]:
print("Customer Country: ")
launches_dataset["Customer Country"]

Customer Country: 


0       United States
1       United States
2       United States
3       United States
4       United States
5                 NaN
6            Malaysia
7                 NaN
8       United States
9       United States
10      United States
11      United States
12      United States
13             Canada
14         Luxembourg
15           Thailand
16      United States
17      United States
18              China
19              China
20      United States
21      United States
22      United States
23            Bermuda
24    France (Mexico)
25      United States
26       Turkmenistan
27      United States
28      United States
29      United States
30         Luxembourg
31      United States
32              Japan
33           Thailand
34            Bermuda
35    France (Mexico)
36      United States
37              Japan
38             Israel
39      United States
40      United States
Name: Customer Country, dtype: object

In [54]:
print("Unique customer countries: ")
pd.unique(launches_dataset["Customer Country"])

Unique customer countries: 


array(['United States', nan, 'Malaysia', 'Canada', 'Luxembourg',
       'Thailand', 'China', 'Bermuda', 'France (Mexico)', 'Turkmenistan',
       'Japan', 'Israel'], dtype=object)

In [55]:
launches_by_country = launches_dataset.groupby("Customer Country")
launches_by_country.head()

Unnamed: 0,Flight Number,Launch Date,Launch Time,Launch Site,Vehicle Type,Payload Name,Payload Type,Payload Mass (kg),Payload Orbit,Customer Name,Customer Type,Customer Country,Mission Outcome,Failure Reason,Landing Type,Landing Outcome
0,F1-1,24 March 2006,22:30,Marshall Islands,Falcon 1,FalconSAT-2,Research Satellite,19.5,,DARPA,Government,United States,Failure,Engine Fire During Launch,,
1,F1-2,21 March 2007,01:10,Marshall Islands,Falcon 1,DemoSat,,,,DARPA,Government,United States,Failure,Engine Shutdown During Launch,,
2,F1-3,3 August 2008,03:34,Marshall Islands,Falcon 1,Trailblazer,Communication Satellite,,,ORS,Government,United States,Failure,Collision During Launch,,
3,F1-3,3 August 2008,03:34,Marshall Islands,Falcon 1,"PRESat, NanoSail-D",Research Satellites,8.0,,NASA,Government,United States,Failure,Collision During Launch,,
4,F1-3,3 August 2008,03:34,Marshall Islands,Falcon 1,Explorers,Human Remains,,,Celestis,Business,United States,Failure,Collision During Launch,,
6,F1-5,14 July 2009,03:35,Marshall Islands,Falcon 1,RazakSAT,Weather Satellite,180.0,Low Earth Orbit,ATSB,Government,Malaysia,Success,,,
13,F9-6,29 September 2013,16:00,Vandenberg AFB SLC-4E,Falcon 9 (v1.1),CASSIOPE,Communication/Research Satellite,500.0,Polar Orbit,MDA Corp,Business,Canada,Success,,Ocean,Failure
14,F9-7,3 December 2013,22:41,Cape Canaveral AFS LC-40,Falcon 9 (v1.1),SES-8,Communication Satellite,3170.0,Geostationary Transfer Orbit,SES,Business,Luxembourg,Success,,,
15,F9-8,6 December 2014,22:06,Cape Canaveral AFS LC-40,Falcon 9 (v1.1),Thaicom 6,Communication Satellite,3325.0,Geostationary Transfer Orbit,Thaicom,Business,Thailand,Success,,,
18,F9-11,5 August 2014,08:00,Cape Canaveral AFS LC-40,Falcon 9 (v1.1),AsiaSat 8,Communication Satellite,4535.0,Geostationary Transfer Orbit,AsiaSat,Business,China,Success,,,


In [45]:
under_4000 = launches_dataset[launches_dataset["Payload Mass (kg)"] < 4000]
print("Launches where Payload Mass (kg) is less than 4000 kg: ")
print(under_4000)

Launches where Payload Mass (kg) is less than 4000 kg: 
   Flight Number        Launch Date Launch Time                  Launch Site  \
0           F1-1      24 March 2006       22:30             Marshall Islands   
3           F1-3      3 August 2008       03:34             Marshall Islands   
5           F1-4  28 September 2008       23:15             Marshall Islands   
6           F1-5       14 July 2009       03:35             Marshall Islands   
10          F9-4     8 October 2012       00:35     Cape Canaveral AFS LC-40   
11          F9-4     8 October 2012       00:35     Cape Canaveral AFS LC-40   
12          F9-5       1 March 2013       15:10     Cape Canaveral AFS LC-40   
13          F9-6  29 September 2013       16:00        Vandenberg AFB SLC-4E   
14          F9-7    3 December 2013       22:41     Cape Canaveral AFS LC-40   
15          F9-8    6 December 2014       22:06     Cape Canaveral AFS LC-40   
16          F9-9      18 April 2014       19:25     Cape Canaver

In [None]:
launches_data