<a href="https://colab.research.google.com/github/szertan/DATASCI112-Space-Thruster-Project/blob/main/STP_Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection and Cleaning 


### 1.1 Downloading the training and test sets through the Kaggle API.

The dataset used for the analysis of this project will be collected through the Kaggle API.

In [None]:
#pull the files from kaggle to colab
! pip install -q kaggle
from google.colab import files
import numpy as np
files.upload() #the kaggle.json file is uploaded from a kaggle profile

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list

In [None]:
! kaggle datasets download -d patrickfleith/spacecraft-thruster-firing-tests-dataset

In [20]:
! unzip /content/spacecraft-thruster-firing-tests-dataset.zip 

Archive:  /content/spacecraft-thruster-firing-tests-dataset.zip
replace STFT Dataset Description.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

### 1.2 Pulling the Metadata into a Pandas DataFrame



The documentation of the dataset used in this project states that the entries 1269 to 2268 are reserved for testing, whereas the entries 0001 to 1001 are saved for training. We also know that the serial number values 1 to 12 are for training (under column 'sn'), and the remaining serial number values may be used for testing. Hence, we will be seperating our data into two subsets based on this information.

The dataset attached to 579 also seems to be missing, to we shall drop the row. 

In [16]:
import pandas as pd
metadata = pd.read_csv("/content/metadata.csv")
metadata = metadata.drop(578)
metadata

Unnamed: 0,uid,filename,test_id,sn,test_pressure,test_mode,vl1,vl2,vl3,anomalous,anomaly_code,cumulated_throughput,cumulated_on_time,cumulated_pulses
0,1,00001_001_SN01_24bars_ssf.csv,1,1,24.0,ssf,True,True,False,False,0.0,0.000000,0.000000,0.0
1,2,00002_002_SN01_21bars_ssf.csv,2,1,21.0,ssf,True,True,False,False,0.0,0.451717,0.083333,1.0
2,3,00003_003_SN01_18bars_ssf.csv,3,1,18.0,ssf,True,True,False,False,0.0,0.842268,0.166667,2.0
3,4,00004_004_SN01_15bars_ssf.csv,4,1,15.0,ssf,True,True,False,False,0.0,1.174725,0.250000,3.0
4,5,00005_005_SN01_12bars_ssf.csv,5,1,12.0,ssf,True,True,False,False,0.0,1.452586,0.333333,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2607,2608,02608_108_SN24_5bars_onmod.csv,108,24,5.0,onmod,True,False,False,False,0.0,20.603296,6.743203,20860.0
2608,2609,02609_109_SN24_5bars_offmod.csv,109,24,5.0,offmod,True,True,False,False,0.0,20.629288,6.765689,21660.0
2609,2610,02610_110_SN24_5bars_random_short.csv,110,24,5.0,random_short,True,True,False,False,0.0,20.868589,6.976800,22460.0
2610,2611,02611_111_SN24_5bars_random_long.csv,111,24,5.0,random_long,True,True,False,False,0.0,20.912820,7.016875,22760.0


### 1.3 Exploring Individual Firing Dataframes 

In [17]:
sample = pd.read_csv('/content/dataset/dataset/train/00001_001_SN01_24bars_ssf.csv')
sample

Unnamed: 0,time,ton,thrust,mfr,vl,anomaly_code
0,2021-01-18 08:00:00.000,0,-0.003521,0.000000e+00,0.0,
1,2021-01-18 08:00:00.010,0,-0.000250,0.000000e+00,0.0,
2,2021-01-18 08:00:00.020,0,-0.002765,0.000000e+00,0.0,
3,2021-01-18 08:00:00.030,0,0.000846,0.000000e+00,0.0,
4,2021-01-18 08:00:00.040,0,0.003115,0.000000e+00,0.0,
...,...,...,...,...,...,...
30595,2021-01-18 08:05:05.950,0,0.000283,0.000000e+00,1.0,
30596,2021-01-18 08:05:05.960,0,-0.001638,0.000000e+00,1.0,
30597,2021-01-18 08:05:05.970,0,0.000856,0.000000e+00,1.0,
30598,2021-01-18 08:05:05.980,0,-0.000922,1.939685e-16,1.0,


A bit of ground work shows that each firing dataset has anywhere from 3000 to 30000 datapoints included inside it. Thus, we need a way to make sense of the information included here. The only two parameters we will be using here will be the thrust and mfr columns. We will achieve this by taking calculating the root-mean-square value of the thrust column and the average value of the mfr column. We will then save these values, one value for each row, into the metadata dataframe. 

In [18]:
metadata.loc[575:580, :]

Unnamed: 0,uid,filename,test_id,sn,test_pressure,test_mode,vl1,vl2,vl3,anomalous,anomaly_code,cumulated_throughput,cumulated_on_time,cumulated_pulses
575,576,00576_016_SN06_24bars_ramp4.csv,16,6,24.0,ramp4,True,True,False,True,28.0,2.736946,0.713361,1343.0
576,577,00577_017_SN06_24bars_onmod.csv,17,6,24.0,onmod,True,True,False,True,28.0,3.334709,0.826417,1416.0
577,578,00578_018_SN06_18.0bars_offmod.csv,18,6,18.0,offmod,True,True,False,True,14.0,4.240511,1.059333,3016.0
579,580,00580_019_SN06_24bars_random_short.csv,19,6,24.0,random_short,True,True,False,False,0.0,5.034695,1.270444,3816.0
580,581,00581_020_SN06_24bars_random_long.csv,20,6,24.0,random_long,True,True,False,False,0.0,5.232672,1.310625,4116.0


In [19]:
for index in metadata.index:
  if index % 100 == 0:   
    print(index)
  path_prefix = '/content/dataset/dataset/train/' if metadata.loc[index, 'sn'] <= 12 else '/content/dataset/dataset/test/'
  read = pd.read_csv(path_prefix + metadata.loc[index, 'filename'])
  metadata.loc[index, 'thrust_rms'] = np.sqrt((read['thrust'] ** 2).mean())
  metadata.loc[index, 'mfr_avg'] = (read['mfr']).mean()

0


ParserError: ignored

In [None]:
metadata_train = metadata.loc[metadata['sn'] <= 12, :]
df_metadata_train = pd.DataFrame(metadata_train)
df_metadata_train.to_csv('/content/drive/MyDrive/STP/metadata_train.csv', index=False)

In [None]:
metadata_test = metadata.loc[metadata['sn'] > 12, :]
df_metadata_test = pd.DataFrame(metadata_test)
df_metadata_test.to_csv('/content/drive/MyDrive/STP/metadata_test.csv', index=False)

We have now seperated our data into training and testing segments and saved them as separate data files. We will be using them for the rest of this project as our primary datasets. 