<a id="top"></a>
<h1>This notebook shows ways to add detail to your plots and make them look elegant, by analyzing the Space Mission Data</h1>

<h3>This Exploratory Data Analysis Answers the Following Questions</h3>
<h2><ol><li><a href="#1">How's the yearly trend in number of missions?</a></li>
    <li><a href="#2">How's the yearly trend in mission success rates?</a></li>
    <li><a href="#12">Which months had the highest and lowest launches?</a></li>
    <li><a href="#14">Which days had the highest and lowest launches?</a></li>
    <li><a href="#3">What's the count of missions per company?</a></li>
    <li><a href="#4">How's the yearly trend in number of missions per company?</a></li>
    <li><a href="#5">What's the success rate per company?</a></li>
    <li><a href="#6">What's the count of missions per launch location?</a></li>
    <li><a href="#7">How's the yearly trend in number of missions per launch location?</a></li>
    <li><a href="#8">How's the yearly trend in number of launches of top 5 locations based on number of launches?</a></li>
    <li><a href="#9">What's the success rate per launch locations?</a></li>
    <li><a href="#10">Is there any relationship between missing price and mission status?</a></li>
    <li><a href="#11">Is there any relationship between price and mission status?</a></li>
    <li><a href="#13">How many missions are active?</a></li>
    <li><a href="#15">What is the trend in the price of mission over the years?</a></li>
    </ol></h2>
    
    
    
    
    

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from wordcloud import WordCloud

df = pd.read_csv("../input/all-space-missions-from-1957/Space_Corrected.csv",parse_dates=True)
df = df.iloc[:,2:].copy() #deleting the unnamed columns
df.head()

In [None]:
df.rename(columns={" Rocket":"Price"},inplace=True) #renaming the rocket column to Price

#Function to extract the launch location from the location address
def country_extract(s):
    s = s.split(",")
    return s[len(s)-1].strip()

df["Country"]=df["Location"].map(country_extract)
df["Status Rocket"] = df["Status Rocket"].str.replace("Status","")

#date extractor from Datum
def extract_date(s):
    s = s.split(" ")
    s="-".join(s[1:4])
    s=s.replace(",","")
    s = datetime.datetime.strptime(s,"%b-%d-%Y")
    return s

df["Datum"] = df["Datum"].map(extract_date)
df["Year"] = df["Datum"].dt.year
df["Month"] = df["Datum"].dt.month_name()

<a id="1"></a><h2>How's the yearly trend in number of missions?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
    <h3>Adding gridlines and markers to a line plot

In [None]:
ax=df["Year"].value_counts().sort_index().plot(figsize=(15,6),marker="x",color="black")
ax.set_axisbelow(True)
ax.set_xticks(df["Year"].unique())
ax.yaxis.grid(color='lightgray', linestyle='dashed')
ax.xaxis.grid(color='lightgray', linestyle='dashed')
plt.title("Number of missions across years")
plt.ylabel("Number of Missions")
plt.xticks(rotation=90)
plt.show()

1965-1978 saw many space missions compared to later years. We can also see a rise in the number of missions from 2016

<a id="2"></a><h2>How's the yearly trend in mission success rates?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
<h3>Creating stacked column plot adding up to 100%

In [None]:
st = df[["Status Mission","Year"]].copy()
st["Status Mission"] = st["Status Mission"].replace(".* Failure","Failure",regex=True)
st = pd.crosstab(columns=st["Status Mission"],index=st["Year"],normalize="index")
ax = st.plot(kind="bar",stacked=True,figsize=(15,5),color=["black","gray"])
ax.set_axisbelow(True)
ax.yaxis.grid(color='gray', linestyle='dashed')
plt.title("Mission success/failure rate over years")
plt.ylabel("% of success/failure")
plt.legend(loc="upper left",ncol=2)
plt.show()

In [None]:
piv = df[["Year","Status Mission"]].copy()
piv["Status Mission"] = piv["Status Mission"].replace(".* Failure","Failure",regex=True)
piv["Year range"]=0
piv["Year range"] = pd.cut(piv["Year"],bins=[1957,1977,1997,2020],labels=["1957-1977","1978-1998","1999-2020"])
piv = pd.crosstab(columns=piv["Status Mission"],index=piv["Year range"])
piv["Success Rate"] = 100*(piv["Success"] / (piv["Success"]+piv["Failure"]))
piv["Failure Rate"] = 100*(piv["Failure"] / (piv["Success"]+piv["Failure"]))
piv.drop(columns=["Failure","Success"],inplace=True)

fig=plt.figure(figsize=(5,5))
ax=sns.heatmap(piv.T,annot=True,fmt="0.0f",cbar=False,cmap="gray",square=True,linewidths=0.1,linecolor="gray")
ax.xaxis.set_ticks_position('top')
#plt.xticks(rotation=0)
#plt.yticks(rotation=0)
plt.title("Success & failure rates over years\n\n")
plt.ylabel("")
plt.xlabel("")
plt.show()

Early years saw a high failure rate when compared with the later years since space technology was still emerging.<br>
Early 20 years (1957-1977) of space science saw a higher failure rate (16%). After that, till date the failure rate stands at 7%. 

<a id="12"></a><h2>Which months had the highest and lowest launches?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
<h3>Word cloud to visualize frequencies

In [None]:
wc = WordCloud(font_step=30,min_font_size=8,background_color="black",color_func=lambda *args, **kwargs: "white")
wc.generate_from_frequencies(frequencies=dict(df["Month"].value_counts()))
fig = plt.figure(figsize=(10,7))
plt.axis("off")
print(df["Month"].value_counts())
plt.imshow(wc);

<a id="14"></a><h2>Which days had the highest and lowest launches?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>


In [None]:
day = df['Datum'].apply(lambda x: x.strftime("%A"))
wc = WordCloud(font_step=10,min_font_size=8,background_color="black",color_func=lambda *args, **kwargs: "white")
wc.generate_from_frequencies(frequencies=dict(day.value_counts()))
fig = plt.figure(figsize=(10,7))
plt.axis("off")
print(day.value_counts())
plt.imshow(wc);

Mot of the launches were made mid-week

<h3>Using bubble plot to visualize frequency

In [None]:
day = df['Datum'].apply(lambda x: x.strftime("%A")).value_counts().sort_values(ascending=True).reset_index()
day.columns = ['day','no of launches']
fig = plt.figure(figsize=(8,9))
sns.scatterplot(data=day,x=5,y=np.linspace(0,100,7),size='no of launches',sizes=(50,2000),legend=False,color='black')
plt.axis('off')
plt.title("Day-wise number of launches")
for i,feat,imp in zip(np.linspace(0,100,7),day["day"],day["no of launches"]):
    plt.text(x=5.05,y=i-1,s=feat)
    plt.text(x=4.89,y=i-1,s=np.round(imp,2))

<a id="3"></a><h2>What's the count of missions per company?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
<h3>Adding data labels to bar plots

In [None]:
cnt = df["Company Name"].value_counts().reset_index()[:20]

sns.catplot(y="index",x="Company Name",data=cnt,kind="bar",height=8,color="black")
plt.title("Companies by no of missions")
plt.ylabel("Company")
plt.xlabel("No of missions")

for i in range(cnt.shape[0]):
    plt.text(s=str(cnt.iloc[i,1]),y=i,x=cnt.iloc[i,1]+10)
plt.show()

RVSN USSR stood as the single largest company in terms of launches amongst multiple companies from different locations, predominantly from the US.

<a id="4"></a><h2>How's the yearly trend in number of missions per company?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
<h3>Swarm plot to visualize multivariate data distribution

In [None]:
cl = df.copy()
cl["Status Mission"] = cl["Status Mission"].replace(".* Failure","Failure",regex=True)
sns.catplot(data=cl.loc[cl["Company Name"].isin(cnt["index"]),:],x="Company Name",y="Year",kind="swarm",height=7,aspect=2,
            palette=sns.set_palette(["gray","black"]),hue="Status Mission")
plt.title("Yearly trend in number of missions per company?")
plt.xticks(rotation=90)
plt.show()

<ul>
<li>
We can see that RVSN USSR, NASA, General Dynamics, and US Air Force as the pioneers of space science.Except NASA other pioneers have no missions after the year 2000.</li>
<li>There are few emerging companies like Roscosmos and Space X who began their journey in the previous decade.</li>

<a id="5"></a><h2>What's the success rate per company?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
<h3>Heat map to visualize data comparison

In [None]:
piv = df[["Company Name","Status Mission"]].copy()
piv["Status Mission"] = piv["Status Mission"].replace(".* Failure","Failure",regex=True)
piv = pd.crosstab(columns=piv["Status Mission"],index=piv["Company Name"])

piv["Success Rate"] = 100*(piv["Success"] / (piv["Success"]+piv["Failure"]))
piv["Failure Rate"] = 100*(piv["Failure"] / (piv["Success"]+piv["Failure"]))
piv.sort_values(by="Success Rate",ascending=False,inplace=True)

piv.drop(columns=["Failure","Success"],inplace=True)

fig,(ax1,ax2)=plt.subplots(2,1,figsize=(15,3))
ax1.set_title("Company-wise success and failure rates")
plt.ylabel("Company")
ax1.xaxis.set_ticks_position('top')
plt.xticks(rotation=90)
sns.heatmap(piv[:28].T,annot=True,fmt="0.0f",cbar=False,cmap="gray",square=True,linewidths=0.1,linecolor="lightgray",ax=ax1)
sns.heatmap(piv[27:].T,annot=True,fmt="0.0f",cbar=False,cmap="gray",square=True,linewidths=0.1,linecolor="lightgray",ax=ax2)
plt.show()

<ul>
<li>Success rate: (Successful Missions / No of launches)*100</li>
<li>Failure rate: (Failed Missions / No of launches)*100 or simply 100-Success rate</li>
<li>We can see few companies have 100% success rate while few have 100% failure rate. SpaceX, who is an emerging company has a success rate of 94%</li>
</ul>

<a id="6"></a><h2>What's the count of missions per launch location?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a>

In [None]:
cnt_co = df.groupby(["Country"]).count()[["Location"]].copy().sort_values(by="Location",ascending=False).reset_index()
sns.catplot(y="Country",x="Location",data=cnt_co,kind="bar",color="black",height=9,aspect=1)
plt.ylabel("Launch location")
plt.xlabel("No of missions")
plt.title("Launch locations & no of missions")
for i in range(cnt_co.shape[0]):
    plt.text(s=str(cnt_co.iloc[i,1]),y=i,x=cnt_co.iloc[i,1]+10)
plt.show()

Launch locations in Russia, Kazakhstan, and the US saw the highest number of missions.
<a id="7"></a><h2>How's the yearly trend in number of missions per launch location?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a>

In [None]:
cl = df.copy()
cl["Status Mission"] = cl["Status Mission"].replace(".* Failure","Failure",regex=True)
sns.catplot(data=cl,x="Country",y="Year",kind="swarm",height=7,aspect=2,palette=sns.set_palette(["gray","black"]),hue="Status Mission")
plt.title("Yearly trend in number of missions per launch location")
plt.xlabel("Launch location")
plt.xticks(rotation=90)
plt.show()

<ul>
<li>Earlier we saw RVSN USSR and three other companies from the US as the pioneers of space science.</li>
<li>Now coming to launch locations, we see Kazakhstan, Russia, and the US were the first locations of space missions in the history. <b><u>When the pioneers (companies) are from Russia and the US, why is Kazakhstan among the pioneers (launch location) along with US and Russia.</u></b></li>
<br><h3>Let's see the companies that launched missions from Kazakhstan
    

In [None]:
list(df.loc[df["Country"]=="Kazakhstan","Company Name"].unique())

We can see that RVSN USSR used Kazakhstan also for it's mission launches. Let's see who launched their missions in 1950s from Kazakhstan.

In [None]:
df.loc[(df["Country"]=="Kazakhstan") & (df["Year"]<1960),["Country","Year","Company Name"]].groupby(["Year","Company Name"]).count()["Country"]

It's RVSN USSR (one of the pioneers) launched one of its early missions from Kazakhstan, which made Kazakhstan one of the pioneers along with Russia and the US

<a id="8"></a><h2>How's the yearly trend in number of launches of top 5 locations based on number of launches?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
<h3>Visualizing multiple line plots

In [None]:
cl1 = df[df["Country"].isin(cnt_co.loc[:4,"Country"])].copy()
cl1.rename(columns={"Country":"Launch_Location"},inplace=True)
cl1 = pd.crosstab(index=cl1["Year"],columns=cl1["Launch_Location"]).copy()
ax=cl1.plot(figsize=(15,6),cmap="Paired",marker="x")
ax.set_axisbelow(True)
ax.yaxis.grid(color='lightgray', linestyle='dashed')
ax.xaxis.grid(color='lightgray', linestyle='dashed')
plt.title("Trend in no. of missions of top 5 locations based on no. of missions")
plt.xlabel("Year")
plt.ylabel("No of launches")
plt.xticks(range(1957,2021))
plt.xticks(rotation=90)
plt.show()

<a id="9"></a><h2>What's the success rate per launch locations?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a>

In [None]:
piv = df[["Country","Status Mission"]].copy()
piv["Status Mission"] = piv["Status Mission"].replace(".* Failure","Failure",regex=True)
piv = pd.crosstab(columns=piv["Status Mission"],index=piv["Country"])

piv["Success Rate"] = 100*(piv["Success"] / (piv["Success"]+piv["Failure"]))
piv["Failure Rate"] = 100*(piv["Failure"] / (piv["Success"]+piv["Failure"]))
piv.sort_values(by="Success Rate",ascending=False,inplace=True)

piv.drop(columns=["Failure","Success"],inplace=True)

fig=plt.figure(figsize=(15,5))
ax=sns.heatmap(piv.T,annot=True,fmt="0.0f",cbar=False,cmap="gray",square=True,linewidth=0.1,linecolor="lightgray")
ax.xaxis.set_ticks_position('top')
plt.xticks(rotation=90)
plt.title("Launch location-wise success and failure rates")
plt.xlabel("Launch location")
plt.ylabel("")
plt.show()

All the missions from Kenya were successful. While, all the missions launched from Brazil were failures. Let's see the status of all the missions from these locations

In [None]:
aa=df.loc[(df["Country"]=="Kenya") | (df["Country"]=="Brazil"),["Country","Status Mission"]].reset_index(drop=True)
aa.columns=["Launch Location","Mission Status"]
aa.groupby(["Launch Location","Mission Status"]).size()

All the 9 missions from Kenya were a success, while all the 3 missions from Brazil were a failure.

<a id="10"></a><h2>Is there any relationship between missing price and mission status?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a>

In [None]:
pri = df[["Status Mission","Price"]].copy()
pri.fillna(0,inplace=True)
pri["Price Missing"] = pri["Price"]==0
pri.groupby(["Status Mission","Price Missing"]).count()["Price"]

<ul>
<li>Failed missions have lot of missing prices. Similarly, successful missions also have very high number of missing prices.</li>
<li>Hence, there is no non-random relationship between the Price Missing and Mission Status</li>

<a id="11"></a><h2>Is there any relationship between price and mission status?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a></h2>
<h3>Box plot to visualize bivariate data distribution

In [None]:
piv = df[["Price","Status Mission"]].copy()
piv["Status Mission"] = piv["Status Mission"].replace(".* Failure","Failure",regex=True)
piv.loc[piv["Price"]=="nan","Price"]=np.nan
piv["Price"] = piv["Price"].str.replace(",","",regex=False)
piv.dropna(inplace=True)
piv["Price"] = piv["Price"].astype(np.float32)
sns.boxplot(data=piv,x="Status Mission",y="Price")
plt.title("Price vs Mission status")
plt.xlabel("Mission status")
plt.ylabel("Price ($ million)")
plt.yscale('log')
plt.show()

From the above plot we see successful missions had a relatively higher price tag

<a id="13"></a><h2>How many missions are active?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a>

In [None]:
fig,ax=plt.subplots(1,1)
sns.countplot(data=df,x='Status Rocket',ax=ax)
ax.set_title("Missions Active & Retired");

<a id="15"></a><h2>What is the trend in the price of mission over the years?&nbsp;&nbsp;&nbsp;&nbsp;<a href="#top">Top</a>


In [None]:
price_year = df[['Year','Price']].copy()
price_year.dropna(inplace=True)
price_year["Price"] = price_year["Price"].str.replace(",","",regex=False).astype('float')
price_year = price_year.groupby('Year').median().copy()
price_year = price_year.reset_index().copy()

ax=price_year.plot(figsize=(15,6),x='Year',y='Price',marker="x",color="black")
ax.set_axisbelow(True)
ax.set_xticks(price_year["Year"].unique())
ax.yaxis.grid(color='lightgray', linestyle='dashed')
ax.xaxis.grid(color='lightgray', linestyle='dashed')
plt.title("Median Price of missions across years")
plt.ylabel("Median Price of Missions")
plt.xticks(rotation=90)
plt.show()

Price of missions between 1974 and 1980 are missing. Price of missions in the early years was high.