<img src="https://www.aex.ru/images/media/900/18171.jpg" width=600></br></br></br>
<h2 align="center">Предсказываем успешность очередного запуска Space X</h2>

## Часть 1. Подготовка датасета

Импортируем библиотеки и знакомимся с данными

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

In [3]:
# Загружаем датасет
data = pd.read_csv('C:/Users/Vladimir/Coding/CSV_Datasets/For BI/SpaceMissions.csv', parse_dates=True)
data.head()

Unnamed: 0,Company,Launch Date,Launch Time,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Vehicle Type,Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Name,Payload Type,Payload Mass (kg),Payload Orbit,Mission Status,Failure Reason
0,SpaceX,24 March 2006,22:30,Marshall Islands,86.0,9.0,74.0,Falcon 1,343,470,22.25,1.5,FalconSAT-2,Research Satellite,19.5,Low Earth Orbit,Failure,Engine Fire During Launch
1,SpaceX,21 March 2007,1:10,Marshall Islands,,,,Falcon 1,343,470,22.25,1.5,DemoSat,Mass simulator,,Low Earth Orbit,Failure,Engine Shutdown During Launch
2,SpaceX,3 August 2008,3:34,Marshall Islands,,,,Falcon 1,343,470,22.25,1.5,Trailblazer,Communication Satellite,,Low Earth Orbit,Failure,Collision During Launch
3,SpaceX,3 August 2008,3:34,Marshall Islands,,,,Falcon 1,343,470,22.25,1.5,"PRESat, NanoSail-D",Research Satellites,8.0,Low Earth Orbit,Failure,Collision During Launch
4,SpaceX,3 August 2008,3:34,Marshall Islands,,,,Falcon 1,343,470,22.25,1.5,Explorers,Human Remains,,Low Earth Orbit,Failure,Collision During Launch


In [4]:
# Смотрим на размерность
data.shape

(150, 18)

In [5]:
# Смотрим на типы полей и в третьей колонке сразу замечаем
# что некоторые столбцы имеют значения < 150. Т.е. там есть NaN значения
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Company                150 non-null    object 
 1   Launch Date            150 non-null    object 
 2   Launch Time            146 non-null    object 
 3   Launch Site            150 non-null    object 
 4   Temperature (° F)      136 non-null    float64
 5   Wind speed (MPH)       136 non-null    float64
 6   Humidity (%)           136 non-null    float64
 7   Vehicle Type           150 non-null    object 
 8   Liftoff Thrust (kN)    150 non-null    int64  
 9   Payload to Orbit (kg)  150 non-null    int64  
 10  Rocket Height (m)      150 non-null    float64
 11  Fairing Diameter (m)   146 non-null    float64
 12  Payload Name           150 non-null    object 
 13  Payload Type           148 non-null    object 
 14  Payload Mass (kg)      133 non-null    object 
 15  Payloa

In [6]:
# Считаем Nan'ы по столбцам
data.isnull().sum()

Company                    0
Launch Date                0
Launch Time                4
Launch Site                0
Temperature (° F)         14
Wind speed (MPH)          14
Humidity (%)              14
Vehicle Type               0
Liftoff Thrust (kN)        0
Payload to Orbit (kg)      0
Rocket Height (m)          0
Fairing Diameter (m)       4
Payload Name               0
Payload Type               2
Payload Mass (kg)         17
Payload Orbit              0
Mission Status             0
Failure Reason           121
dtype: int64

Приступаем к чистке. Нам нужно убрать Nan'ы. Будем заполнять их с помощью модуля SimpleImputer библиотеки sklearn.
Начинаем с первого поля -  Launch Time, и так двигаемся вниз по списку.

In [7]:
# Чисттим Launch Time. Используем стратегию most frequent
x = data.iloc[:, 2].values
x = x.reshape(-1,1) 
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(x)
x = imputer.transform(x)
data.iloc[:, 2] = x 

In [8]:
# Смотрим на результаты
data['Launch Time'].describe()

count       150
unique      132
top       14:29
freq          7
Name: Launch Time, dtype: object

In [9]:
# Заполняем поле Temperature (° F). Теперь будем заполнять по среднему значению
x = data.iloc[:, 4].values
x = x.reshape(-1,1)
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(x)
x = imputer.transform(x)
data.iloc[:, 4] = x
data.isnull().sum()

Company                    0
Launch Date                0
Launch Time                0
Launch Site                0
Temperature (° F)          0
Wind speed (MPH)          14
Humidity (%)              14
Vehicle Type               0
Liftoff Thrust (kN)        0
Payload to Orbit (kg)      0
Rocket Height (m)          0
Fairing Diameter (m)       4
Payload Name               0
Payload Type               2
Payload Mass (kg)         17
Payload Orbit              0
Mission Status             0
Failure Reason           121
dtype: int64

In [10]:
# Скорость ветра (Wind speed (MPH))
x = data.iloc[:, 5].values
x = x.reshape(-1,1)
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(x)
x = imputer.transform(x)
data.iloc[:, 5] = x
data.isnull().sum()

Company                    0
Launch Date                0
Launch Time                0
Launch Site                0
Temperature (° F)          0
Wind speed (MPH)           0
Humidity (%)              14
Vehicle Type               0
Liftoff Thrust (kN)        0
Payload to Orbit (kg)      0
Rocket Height (m)          0
Fairing Diameter (m)       4
Payload Name               0
Payload Type               2
Payload Mass (kg)         17
Payload Orbit              0
Mission Status             0
Failure Reason           121
dtype: int64

In [11]:
# Влажность Humidity (%)
x = data.iloc[:, 6].values
x = x.reshape(-1,1)
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(x)
x = imputer.transform(x)
data.iloc[:, 6] = x
data.isnull().sum()

Company                    0
Launch Date                0
Launch Time                0
Launch Site                0
Temperature (° F)          0
Wind speed (MPH)           0
Humidity (%)               0
Vehicle Type               0
Liftoff Thrust (kN)        0
Payload to Orbit (kg)      0
Rocket Height (m)          0
Fairing Diameter (m)       4
Payload Name               0
Payload Type               2
Payload Mass (kg)         17
Payload Orbit              0
Mission Status             0
Failure Reason           121
dtype: int64

Теперь более сложная задача. До этого мы заполняли NaN средними, взятым по всему диапозону значений.
Следующее по порядку поле - Fairing Diameter (m) (диаметр обтекателя). И здесь мы не можем применить такую стратегию, т.к. диаметр обтекателя у разных вендоров свой. Нам нужно вычислять среднюю взятую по диапозону зачений каждого вендора.
Смотрим на метрики поля 'Fairing Diameter (m)'

In [12]:
data['Fairing Diameter (m)'].describe() # 150-146 = 4 NaN значения

count    146.000000
mean       4.252329
std        1.308869
min        1.000000
25%        3.000000
50%        5.200000
75%        5.200000
max        5.400000
Name: Fairing Diameter (m), dtype: float64

In [13]:
# Выводим строки, где значения по этому полю Nan
data[data["Fairing Diameter (m)"].isnull()]

Unnamed: 0,Company,Launch Date,Launch Time,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Vehicle Type,Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Name,Payload Type,Payload Mass (kg),Payload Orbit,Mission Status,Failure Reason
21,European Space Agency,23 May 1980,14:29,Guiana Space Centre,70.441176,8.007353,76.066176,Ariane 1,2772,1850,50.0,,CAT-2 / Amsat P3A / Firewheel Subsat,Communication Satellite,,Geostationary Transfer Orbit,Failure,Combustion instability that had occurred in on...
22,European Space Agency,9 September 1982,2:12,Guiana Space Centre,70.441176,8.007353,76.066176,Ariane 1,2772,1850,50.0,,MARCES-B / Sirio-2,Communication Satellite,1710.0,Geostationary Transfer Orbit,Failure,Third stage turbopump malfunction.
28,Arianespace,4 June 1996,12:34,Guiana Space Centre,70.441176,8.007353,76.066176,Ariane 5 G,11400,6900,52.0,,Cluster,European Space Agency spacecraft,,High Earth Orbit,Failure,The Ariane 5 program's first launch failed bec...
29,Arianespace,1 December 1994,22:57,Guiana Space Centre,70.441176,8.007353,76.066176,Ariane 42P,4334,2930,58.72,,Panamsat-3,Communication Satellite,2920.0,Geostationary Transfer Orbit,Failure,


Это значит, что нам нужно посчитать средний диаметр обтекателя ракеты по двум вендороам: European Space Agency и Arianespace

In [14]:
# У первого вендора есть 2 точных значения и 2 пустых
fm_arienaspace = data[data['Company'] == 'Arianespace']
list_1 = fm_arienaspace['Fairing Diameter (m)']
list_1

26    2.6
27    5.4
28    NaN
29    NaN
Name: Fairing Diameter (m), dtype: float64

In [15]:
# У второго вендора оба значения пустые. Значит нам остается заполнить их значениями от Arianespace
fm_arienaspace = data[data['Company'] == 'European Space Agency']
list_1 = fm_arienaspace['Fairing Diameter (m)']
list_1

21   NaN
22   NaN
Name: Fairing Diameter (m), dtype: float64

In [16]:
#Смотрим на состав значений для Fairing diameter
data['Fairing Diameter (m)'].unique()

array([1.5 , 5.2 , 2.9 , 4.  , 5.1 , 3.  , 1.52,  nan, 1.  , 2.6 , 5.4 ])

In [17]:
# Заполняем NaN 'Fairing Diameter (m)' средней по Arianespace
fm_arienaspace = data[data['Company'] == 'Arianespace']
list_1 = fm_arienaspace['Fairing Diameter (m)']

x = np.mean(list_1)

for a,b in zip(data['Company'], data['Fairing Diameter (m)']):
    if a == 'Arianespace'or a == 'European Space Agency':
         data["Fairing Diameter (m)"] = data["Fairing Diameter (m)"].fillna(x)
data['Fairing Diameter (m)'].unique()

array([1.5 , 5.2 , 2.9 , 4.  , 5.1 , 3.  , 1.52, 1.  , 2.6 , 5.4 ])

In [18]:
# Смотрим NaNы в Payload Type. Оба у SpaceX
data[data['Payload Type'].isnull()]

Unnamed: 0,Company,Launch Date,Launch Time,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Vehicle Type,Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Name,Payload Type,Payload Mass (kg),Payload Orbit,Mission Status,Failure Reason
30,SpaceX,28 September 2008,23:15,Marshall Islands,70.441176,8.007353,76.066176,Falcon 1,343,470,22.25,1.5,RatSat (DemoSat),,165.0,Low Earth Orbit,Success,
32,SpaceX,4 June 2010,18:45,Cape Canaveral,78.0,7.0,88.0,Falcon 9 (v1.0),4940,10450,54.9,5.2,Dragon Spacecraft Qualification Unit,,,Low Earth Orbit,Success,


In [19]:
# А что вообще в этом столбце ?
pld_spacex = data[data['Company'] == 'SpaceX']
pld_spacex[pld_spacex['Payload Type'].notnull()]
pld_spacex['Payload Type'].unique()

array(['Research Satellite', 'Mass simulator', 'Communication Satellite',
       'Research Satellites', 'Human Remains', 'Space Station Supplies',
       nan, 'Weather Satellite', 'Communication/Research Satellite',
       'Classified', 'high-speed mobile broadband service',
       'Earth observation satellite', 'reusable uncrewed spacecraft',
       'Direct-to-Home (DTH) broadcast, broadband, and backhaul services',
       'Governmental and institutional security user needs satellite',
       'Tesla carrying mannequin Starman – sporting SpaceX spacesuit',
       "Spain's spy satellite / internet satellite", 'Star monitor',
       'Communication Satellite / Research satellite',
       'Direct-to-Home Television Services', 'GPS III satellites',
       'Communication Satellite / Moon lander / Reseach Satellite',
       'Uncrewed Test Commercial Crew program',
       'Earth observation satellites', 'in-flight abort test'],
      dtype=object)

In [20]:
# Смотрим метрики Payload Type именно по SpaceX
pld_spacex = data[data['Company'] == 'SpaceX']
pld_spacex['Payload Type'].describe()

count                          94
unique                         24
top       Communication Satellite
freq                           45
Name: Payload Type, dtype: object

Top у Communication Satellite. Тогда им и заполним Nan'ы

In [21]:
# Заполняем Payload Type
data['Payload Type'] = data['Payload Type'].fillna('Communication Satellite')
data.isnull().sum()

Company                    0
Launch Date                0
Launch Time                0
Launch Site                0
Temperature (° F)          0
Wind speed (MPH)           0
Humidity (%)               0
Vehicle Type               0
Liftoff Thrust (kN)        0
Payload to Orbit (kg)      0
Rocket Height (m)          0
Fairing Diameter (m)       0
Payload Name               0
Payload Type               0
Payload Mass (kg)         17
Payload Orbit              0
Mission Status             0
Failure Reason           121
dtype: int64

In [22]:
#Смотрим у каких спутников отсутствует значение Payload Mass (kg)
a = data[data['Payload Mass (kg)'].isnull()]
a['Payload Name'].unique()

array(['DemoSat', 'Trailblazer', 'Explorers', 'Orion 3',
       'DemoSat / 3CS-1 & 2', 'IDCSP GGTS-2', 'LCS-2',
       'CAT-2 / Amsat P3A / Firewheel Subsat', 'Cluster',
       'Dragon Spacecraft Qualification Unit', 'SpaceX CRS (Dragon C1)',
       'SpaceX CRS (Dragon C2+)', 'OG2 Mission 1 (6 OG2 Satellites)',
       'OG2 Mission 2 (11 OG2 Satellites)', 'STP-2', 'MS-11', 'BONUM-1'],
      dtype=object)

In [23]:
# А что вообще там?
data['Payload Mass (kg)'].unique()

array(['19.5', nan, '8', '150', '1952', '5500', '711', '2030', '700',
       '2750', '3000', '0', '1710', '115', '80', '74', '1197', '5480',
       '2920', '165', '180', '500', '677', '3170', '3325', '2296', '4535',
       '4428', '2216', '2395', '570', '4159', '1898', '4707', '553',
       '5271', '3136', '4696', '3100', '3600', '2257', '4600', '9600',
       '2490', '5600', '5300', 'Classified', '6070', '2708', '3669',
       '6761', '3310', '475', '4990', '5200', '3500', '2205', '4230',
       '1300', '2141', '6092', '2647', '362', '5960', '5400', '2700',
       '7080', '5800', '7060', '2573', '3880', '5380', '6000', '2500',
       '13620', '4200', '6500', '15600', '6956', '6350', '338', '290',
       '305', '2450', '2200', '550', '1360', '2032', '210', '4348', '673',
       '758', '1800', '835', '636', '1100', '660', '689', '3117', '328',
       '970'], dtype=object)

В наборе числовых значений затесалось  текстовое значние 'Classified'. 
Будем его вычищать

In [24]:
# фильтруем и присваиваем значение 0
data.loc[ data['Payload Mass (kg)'] == 'Classified', 'Payload Mass (kg)'] = 0 

In [25]:
# текст убрали, дальше очередь Nan
data['Payload Mass (kg)'].unique()

array(['19.5', nan, '8', '150', '1952', '5500', '711', '2030', '700',
       '2750', '3000', '0', '1710', '115', '80', '74', '1197', '5480',
       '2920', '165', '180', '500', '677', '3170', '3325', '2296', '4535',
       '4428', '2216', '2395', '570', '4159', '1898', '4707', '553',
       '5271', '3136', '4696', '3100', '3600', '2257', '4600', '9600',
       '2490', '5600', '5300', 0, '6070', '2708', '3669', '6761', '3310',
       '475', '4990', '5200', '3500', '2205', '4230', '1300', '2141',
       '6092', '2647', '362', '5960', '5400', '2700', '7080', '5800',
       '7060', '2573', '3880', '5380', '6000', '2500', '13620', '4200',
       '6500', '15600', '6956', '6350', '338', '290', '305', '2450',
       '2200', '550', '1360', '2032', '210', '4348', '673', '758', '1800',
       '835', '636', '1100', '660', '689', '3117', '328', '970'],
      dtype=object)

In [26]:
#обновим тип значений на float т.к у нас первое значение в массиве= 19,5
data['Payload Mass (kg)'] = data['Payload Mass (kg)'].astype(float)
data['Payload Mass (kg)'].unique()

array([1.950e+01,       nan, 8.000e+00, 1.500e+02, 1.952e+03, 5.500e+03,
       7.110e+02, 2.030e+03, 7.000e+02, 2.750e+03, 3.000e+03, 0.000e+00,
       1.710e+03, 1.150e+02, 8.000e+01, 7.400e+01, 1.197e+03, 5.480e+03,
       2.920e+03, 1.650e+02, 1.800e+02, 5.000e+02, 6.770e+02, 3.170e+03,
       3.325e+03, 2.296e+03, 4.535e+03, 4.428e+03, 2.216e+03, 2.395e+03,
       5.700e+02, 4.159e+03, 1.898e+03, 4.707e+03, 5.530e+02, 5.271e+03,
       3.136e+03, 4.696e+03, 3.100e+03, 3.600e+03, 2.257e+03, 4.600e+03,
       9.600e+03, 2.490e+03, 5.600e+03, 5.300e+03, 6.070e+03, 2.708e+03,
       3.669e+03, 6.761e+03, 3.310e+03, 4.750e+02, 4.990e+03, 5.200e+03,
       3.500e+03, 2.205e+03, 4.230e+03, 1.300e+03, 2.141e+03, 6.092e+03,
       2.647e+03, 3.620e+02, 5.960e+03, 5.400e+03, 2.700e+03, 7.080e+03,
       5.800e+03, 7.060e+03, 2.573e+03, 3.880e+03, 5.380e+03, 6.000e+03,
       2.500e+03, 1.362e+04, 4.200e+03, 6.500e+03, 1.560e+04, 6.956e+03,
       6.350e+03, 3.380e+02, 2.900e+02, 3.050e+02, 

In [27]:
# Заполняем Nan'ы в Payload Mass (kg) средним значением
x = data.iloc[:, -4].values
x = x.reshape(-1,1)
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(x)
x = imputer.transform(x)
data.iloc[:, -4] = x

data.isnull().sum()

Company                    0
Launch Date                0
Launch Time                0
Launch Site                0
Temperature (° F)          0
Wind speed (MPH)           0
Humidity (%)               0
Vehicle Type               0
Liftoff Thrust (kN)        0
Payload to Orbit (kg)      0
Rocket Height (m)          0
Fairing Diameter (m)       0
Payload Name               0
Payload Type               0
Payload Mass (kg)          0
Payload Orbit              0
Mission Status             0
Failure Reason           121
dtype: int64

In [29]:
# конвертируем обратно в int
data['Payload Mass (kg)'] = data['Payload Mass (kg)'].astype(int)
data['Payload Mass (kg)'].unique()

array([   19,  3443,     8,   150,  1952,  5500,   711,  2030,   700,
        2750,  3000,     0,  1710,   115,    80,    74,  1197,  5480,
        2920,   165,   180,   500,   677,  3170,  3325,  2296,  4535,
        4428,  2216,  2395,   570,  4159,  1898,  4707,   553,  5271,
        3136,  4696,  3100,  3600,  2257,  4600,  9600,  2490,  5600,
        5300,  6070,  2708,  3669,  6761,  3310,   475,  4990,  5200,
        3500,  2205,  4230,  1300,  2141,  6092,  2647,   362,  5960,
        5400,  2700,  7080,  5800,  7060,  2573,  3880,  5380,  6000,
        2500, 13620,  4200,  6500, 15600,  6956,  6350,   338,   290,
         305,  2450,  2200,   550,  1360,  2032,   210,  4348,   673,
         758,  1800,   835,   636,  1100,   660,   689,  3117,   328,
         970])

Наша задача - подготовить датасет для ML-моделирования. Это значит, что все категориальные значения должны
быть закодированы в числовом формате. Этим мы и займемся далее

In [30]:
#Столбец Company, смотрим на состав
data['Company'].unique()

array(['SpaceX', 'Boeing', 'Martin Marietta', 'US Air Force',
       'European Space Agency', 'Brazilian Space Agency', 'Arianespace'],
      dtype=object)

In [31]:
# Кодируем значения в Company целыми числами
label = LabelEncoder()
dicts = {}
label.fit(data['Company'].drop_duplicates()) 
data['Company'] = label.transform(data['Company'])

In [32]:
#Company, смотрим что после кодирования
data['Company'].unique()

array([5, 1, 4, 6, 3, 2, 0])

In [62]:
#Launch Site по такому же принципу
data['Launch Site'].unique()

array(['Marshall Islands', 'Cape Canaveral', 'Vandenberg ',
       ' Guiana Space Centre', 'Alcântara Launch Center',
       'Kennedy Space Center'], dtype=object)

In [33]:
#Launch Site по такому же принципу
label.fit(data['Launch Site'].drop_duplicates()) 
data['Launch Site'] = label.transform(data['Launch Site']) 
data['Launch Site'].unique()

array([4, 2, 5, 0, 1, 3])

In [34]:
#Исправим имя у Vehicle Type. Пробел между словами будет нам мешать на следующем шаге
data = data.rename({'Vehicle Type': "Vehicle_Type"}, axis = 1) 
data['Vehicle_Type'].unique()

array(['Falcon 1', 'Falcon 9 (v1.0)', 'Falcon 9 (v1.1)',
       'Falcon 9 Full Thrust (v1.2)', 'Delta II 7925', 'Delta III 8930',
       'Delta IV Heavy', 'Titan II(23)G', 'Titan IIIC', 'Titan III(24)B',
       'Titan IIIB', 'Titan IIIA', 'Ariane 1', 'VLS-1', 'Vega',
       'Ariane 5 ECA', 'Ariane 5 G', 'Ariane 42P', 'Falcon 9 Block 3',
       'Falcon 9 Block 4', 'Falcon Heavy', 'Falcon 9 Block 5',
       'Delta II 7920-10C', 'Delta II 7425', 'Delta II 7426',
       'Delta II 7920-10', 'Delta II 7420-10C', 'Delta II 7320-10C',
       'Delta II 7326', 'Delta II 7425-10C', 'Delta IV Medium+ (4,2)'],
      dtype=object)

Т.к. названий носителей много, то сгруппируем их по бренду

In [35]:
data['vehicle_type'] = data.Vehicle_Type.str.extract('([A-Za-z]+)', expand=False)

In [36]:
# Кодируем vehicle_type
titles = {"Falcon": 0, "Delta": 1, "Titan": 2, "Ariane": 3, "Vega": 4, "VLS": 5} 
data['vehicle_type'] = data.Vehicle_Type.str.extract('([A-Za-z]+)', expand=False) 
data['vehicle_type'] = data['vehicle_type'].map(titles)  
data = data.drop(['Vehicle_Type'], axis=1)
data.head()

Unnamed: 0,Company,Launch Date,Launch Time,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Name,Payload Type,Payload Mass (kg),Payload Orbit,Mission Status,Failure Reason,vehicle_type
0,5,24 March 2006,22:30,4,86.0,9.0,74.0,343,470,22.25,1.5,FalconSAT-2,Research Satellite,19,Low Earth Orbit,Failure,Engine Fire During Launch,0
1,5,21 March 2007,1:10,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,DemoSat,Mass simulator,3443,Low Earth Orbit,Failure,Engine Shutdown During Launch,0
2,5,3 August 2008,3:34,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,Trailblazer,Communication Satellite,3443,Low Earth Orbit,Failure,Collision During Launch,0
3,5,3 August 2008,3:34,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,"PRESat, NanoSail-D",Research Satellites,8,Low Earth Orbit,Failure,Collision During Launch,0
4,5,3 August 2008,3:34,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,Explorers,Human Remains,3443,Low Earth Orbit,Failure,Collision During Launch,0


In [37]:
# Удалим причину аварии, т.к. это поле несет просто описательную составляющую
data = data.drop(['Failure Reason'], axis = 1)

In [38]:
# Payload Name and Type
# в этих полях слишком много уникальных значений, которые трудно будет закодировать. Поэтому их тоже дропаем
data = data.drop(['Payload Name', 'Payload Type'], axis = 1)

In [39]:
#Payload Orbit
data['Payload Orbit'].unique()

array(['Low Earth Orbit', 'Geostationary Transfer Orbit',
       'Medium Earth Orbit', 'Sun-Synchronous Orbit', 'Polar Orbit',
       'High Earth Orbit', 'Sun/Earth Orbit', 'Heliocentric Orbit',
       'Suborbital', 'Mars Orbit', 'Earth-Moon L2'], dtype=object)

In [40]:
# кодируем Payload Orbit
label.fit(data['Payload Orbit'].drop_duplicates())
data['Payload Orbit'] = label.transform(data['Payload Orbit'])
data['Payload Orbit'].unique()

array([ 4,  1,  6,  9,  7,  3, 10,  2,  8,  5,  0])

In [41]:
# У  Mission Status может быть всего 2 значения )
data.loc[data['Mission Status'] == 'Failure', 'Mission Status'] = 0
data.loc[data['Mission Status'] == 'Success', 'Mission Status'] = 1

In [42]:
# Смотрим на проиежуточный датафрейм
data.head()

Unnamed: 0,Company,Launch Date,Launch Time,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Mass (kg),Payload Orbit,Mission Status,vehicle_type
0,5,24 March 2006,22:30,4,86.0,9.0,74.0,343,470,22.25,1.5,19,4,0,0
1,5,21 March 2007,1:10,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0
2,5,3 August 2008,3:34,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0
3,5,3 August 2008,3:34,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,8,4,0,0
4,5,3 August 2008,3:34,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0


Теперь займемся датами

In [43]:
# Объеденим время и дату запуска в единое поле, предварительно переведя их в строковый тип
data['Launch Date'] = data['Launch Date'].astype(str)
data['Launch Time'] = data['Launch Time'].astype(str)
data['Launch_Time'] = data['Launch Date'].str.cat(data['Launch Time'],sep=" ")
data = data.drop(['Launch Time', 'Launch Date'], axis = 1)
data.head()

Unnamed: 0,Company,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Mass (kg),Payload Orbit,Mission Status,vehicle_type,Launch_Time
0,5,4,86.0,9.0,74.0,343,470,22.25,1.5,19,4,0,0,24 March 2006 22:30
1,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0,21 March 2007 1:10
2,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0,3 August 2008 3:34
3,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,8,4,0,0,3 August 2008 3:34
4,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0,3 August 2008 3:34


In [44]:
# Конвертируем новый столбец в объект datetime
from datetime import datetime
data['Launch_Time'] = data['Launch_Time'].map(lambda x: datetime.strptime(x, '%d %B %Y %H:%M'))

In [45]:
data['Mission Status'] = pd.to_numeric(data['Mission Status'])

Итоговый датафрейм

In [46]:
data.head()

Unnamed: 0,Company,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Mass (kg),Payload Orbit,Mission Status,vehicle_type,Launch_Time
0,5,4,86.0,9.0,74.0,343,470,22.25,1.5,19,4,0,0,2006-03-24 22:30:00
1,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0,2007-03-21 01:10:00
2,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0,2008-08-03 03:34:00
3,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,8,4,0,0,2008-08-03 03:34:00
4,5,4,70.441176,8.007353,76.066176,343,470,22.25,1.5,3443,4,0,0,2008-08-03 03:34:00


In [47]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Company                150 non-null    int32         
 1   Launch Site            150 non-null    int32         
 2   Temperature (° F)      150 non-null    float64       
 3   Wind speed (MPH)       150 non-null    float64       
 4   Humidity (%)           150 non-null    float64       
 5   Liftoff Thrust (kN)    150 non-null    int64         
 6   Payload to Orbit (kg)  150 non-null    int64         
 7   Rocket Height (m)      150 non-null    float64       
 8   Fairing Diameter (m)   150 non-null    float64       
 9   Payload Mass (kg)      150 non-null    int32         
 10  Payload Orbit          150 non-null    int32         
 11  Mission Status         150 non-null    int64         
 12  vehicle_type           150 non-null    int64         
 13  Launc

Выше видим, что в третьем столбце все значения 150, что равно размеру датасета. Это значит, что у нас нет
пустых (NaN) значений

### Часть 2. Запускаем ML!

In [48]:
#Смотрим на колонки нашего датафрейма
data.sample()

Unnamed: 0,Company,Launch Site,Temperature (° F),Wind speed (MPH),Humidity (%),Liftoff Thrust (kN),Payload to Orbit (kg),Rocket Height (m),Fairing Diameter (m),Payload Mass (kg),Payload Orbit,Mission Status,vehicle_type,Launch_Time
138,1,2,68.0,15.0,83.0,3511,1819,38.1,2.9,2032,6,1,1,2001-01-30 07:55:00


In [49]:
# убраем зависимую переменную (Mission Status) из train и переводим ее в target
train = data.iloc[:, [0,1,2,3,4,5,6,7,8,9,10,12]]
target = data.iloc[:, -3].values

In [50]:
# Импортируем библиотеки для ML
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
import statsmodels.api as sm

In [75]:
# Делаем разбивку на обучающий и контрольные наборы
x_train, x_test, y_train, y_test = train_test_split(train,target, test_size = 0.2, random_state = 0)

In [52]:
# Сделаем разбивку еще раз на основе начального датасета (для сравнения результатов модели)
# нам нужен будет только full_test
full_train, full_test = train_test_split(data, test_size = 0.2, random_state = 0)

In [53]:
# Нормализуем наши выборки, чтобы они подошли под scikit (mean = 0 и std = 1)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test) 

Пишем функции для разных моделей ML и сразу их проверяем

In [64]:
# Функция для Logistic Regression
def LogReg(x_train, y_train, x_test, y_test):
    regressor =  LogisticRegression(random_state=0, solver='lbfgs')
    regressor.fit(x_train, y_train)
    y_pred = regressor.predict(x_test)
    cm = metrics.confusion_matrix(y_test, y_pred)
    print('Показатели Ligistic Regresion:')
    print('R-квадрат:', metrics.r2_score(y_test, y_pred))
    print('Матрица ошибок:\n', cm)
    print('Точность модели: ', metrics.accuracy_score(y_pred, y_test))
    return y_pred

In [72]:
result = LogReg(x_train,y_train,x_test, y_test)
result

Показатели Ligistic Regresion:
R-квадрат: 0.5833333333333334
Матрица ошибок:
 [[ 4  2]
 [ 0 24]]
Точность модели:  0.9333333333333333


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0], dtype=int64)

In [66]:
# Функция для Random Forest Classifier
def RndFst(x_train,y_train,x_test,y_test):
    classifier = RandomForestClassifier(n_estimators=100)
    classifier.fit(x_train, y_train)
    y_pred = classifier.predict(x_test)
    cm = metrics.confusion_matrix(y_test, y_pred)
    print('Показатели Random Forest Classifier:')
    print('R-квадрат:', metrics.r2_score(y_test, y_pred))
    print('Матрица ошибок:\n', cm)
    print('Точность модели: ', metrics.accuracy_score(y_pred, y_test))
    return y_pred

In [67]:
result = RndFst(x_train,y_train,x_test,y_test)
result

Показатели Random Forest Classifier:
R-квадрат: 0.5833333333333334
Матрица ошибок:
 [[ 4  2]
 [ 0 24]]
Точность модели:  0.9333333333333333


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0], dtype=int64)

In [68]:
# Функция для k-NN
def KNghb(x_train,y_train,x_test,y_test):
    classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) #че за минковски?
    classifier.fit(x_train, y_train)
    y_pred = classifier.predict(x_test)
    cm = metrics.confusion_matrix(y_test, y_pred)
    print('Показатели KNeighbors Classifier:')
    print('R-квадрат:', metrics.r2_score(y_test, y_pred))
    print('Матрица ошибок:\n', cm)
    print('Точность модели: ', metrics.accuracy_score(y_pred, y_test))
    return y_pred

In [69]:
result = KNghb(x_train,y_train,x_test,y_test)
result

Показатели KNeighbors Classifier:
R-квадрат: 0.5833333333333334
Матрица ошибок:
 [[ 4  2]
 [ 0 24]]
Точность модели:  0.9333333333333333


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0], dtype=int64)

In [70]:
# Функция для Naive Bayes Algorithm
def NBA(x_train,y_train,x_test,y_test):
    classifier = GaussianNB()
    classifier.fit(x_train, y_train)
    y_pred = classifier.predict(x_test)
    cm = metrics.confusion_matrix(y_test, y_pred)
    print('Показатели Naive Bayes Algorithm:')
    print('R-квадрат:', metrics.r2_score(y_test, y_pred))
    print('Матрица ошибок:\n', cm)
    print('Точность модели: ', metrics.accuracy_score(y_pred, y_test))
    return y_pred

In [71]:
result = NBA(x_train,y_train,x_test,y_test)
result

Показатели Naive Bayes Algorithm:
R-квадрат: 0.16666666666666674
Матрица ошибок:
 [[ 5  1]
 [ 3 21]]
Точность модели:  0.8666666666666667


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0], dtype=int64)

Первые три модели показали одинаковую точность. Остановимся на логистической регресии и возьмем
ее результат для сравнения с контрольной выборкой

### Часть 3. Сравниваем факт и предсказание

In [76]:
#Сохраняем индексы контрольного датафрейма, чтобы потом сделать по ним мердж
saved_indecies = pd.DataFrame(x_test.index)
#объеденяем сохраненные индексы и предсказанный результат
saved_indecies.insert(1,'Predicted', result)
saved_indecies.reset_index(inplace=True)
saved_indecies.set_index([0], inplace=True)
saved_indecies = saved_indecies.drop(['index'], axis=1)
#вставили предсказанный результат в контрольный датафрейм.
df=pd.merge(full_test, saved_indecies, how='left', left_index=True, right_index=True)

In [77]:
# Раскодируем обратно Mission Status и Predicted
df.loc[df['Mission Status'] == 0, 'Mission Status'] = 'Failure'
df.loc[df['Mission Status'] == 1, 'Mission Status'] = 'Success'
df.loc[df['Predicted'] == 0, 'Predicted'] = 'Failure'
df.loc[df['Predicted'] == 1, 'Predicted'] = 'Success'

In [78]:
# Имеем 2 ошибки - в 7 и 8 индексе.
df[['Mission Status', 'Predicted']]

Unnamed: 0,Mission Status,Predicted
114,Success,Success
62,Success,Success
33,Success,Success
107,Success,Success
7,Failure,Success
100,Success,Success
40,Success,Success
86,Success,Success
76,Success,Success
71,Success,Success
