# Homework 1 Analytics Base Table Construciton
---
In this homework assignment, you will begin to explore the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM). 


Below you will find a number of steps that you will be required to complete before you can start the assignment.

---

## Step 1: Downloading the Data
---

This assignment will only be using [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB), but we will be using more than one by the end of the semster. In later steps, you will need to access the uncompressed files from these partitions, so remember where you put them.

A paper describing the construction of the dataset can be found [here](https://doi.org/10.1038/s41597-020-0548-x).

---

Individual partitions of the dataset can be accessed through following links:
- [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB)
- [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD)
- [Partition 3](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/PTPGQT)
- [Partition 4](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/FIFLFU)
- [Partition 5](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/QC2C3X)

---

### Dataset Attributes:

Each file in the dataset contains the following attributes as a single variate of the multivariate timeseries sample. 

|              |                  |             |
|--------------|------------------|-------------|
| 1. Timestamp | 2. TOTUSJH       | 3. TOTBSQ   |	
| 4. TOTPOT	   | 5. TOTUSJZ       | 6. ABSNJZH  |	
| 7. SAVNCPP   | 8. USFLUX        | 9. TOTFZ	|
| 10. MEANPOT  | 11. EPSZ	      | 12. MEANSHR |
| 13. SHRGT45  | 14. MEANGAM      | 15. MEANGBT |
| 16. MEANGBZ  | 17. MEANGBH      | 18. MEANJZH |
| 19. TOTFY    | 20. MEANJZD      | 21. MEANALP |	
| 22. TOTFX    | 23. EPSY	      | 24. EPSX	|
| 25. R_VALUE  | 26. CRVAL1       | 27. CRLN_OBS|	
| 28. CRLT_OBS | 29. CRVAL2       | 30. HC_ANGLE|	
| 31. SPEI     | 32. LAT_MIN      | 33. LON_MIN |
| 34. LAT_MAX  | 35. LON_MAX      | 36. QUALITY |	
| 37. BFLARE   | 38. BFLARE_LABEL |	39. CFLARE  |	
| 39. CFLARE_LABEL | 40. MFLARE | 41. MFLARE_LABEL |	
| 42. XFLARE | 43. XFLARE_LABEL | 44. BFLARE_LOC |	
| 45. BFLARE_LABEL_LOC | 46. CFLARE_LOC | 47. CFLARE_LABEL_LOC |	
| 48. MFLARE_LOC | 49. MFLARE_LABEL_LOC | 50. FLARE_LOC |	
| 51. XFLARE_LABEL_LOC | 52. XR_MAX | 53. XR_QUAL |	
|54. IS_TMFI | | |

---


## Step 2: Unpacking the data
---

The partitions come in tar.gz archive files. These are easily opened on all current operating systems using the same command in the terminal.

- On Windows 10: Use cmd.exe, then run: tar xf partition1_instances.tar.gz
- On Linux: In the terminal run: tar xf partition1_instances.tar.gz
- On Mac: In the terminal run: tar xf partition1_instances.tar.gz

These all assume you are in the directory that containes the tar.gz file and that you wish to unpack in this same directory.  Search for tar commands if you wish to do something else.

---

## About the data
---

The __partition1__ direcotry contains two subdirectories, __FL__ and __NF__, these subdirectories represent the two classes of our target feature in the solar flare prediction problem we will be attempting to solve this semester. 

- __FL__: Represents the multivariate time series samples that have a Solar Flare occur within 24 hours of the observation.
- __NF__: Represents the multivariate time series samples that do not have a Solar Flare occur within 24 hours of the observation.

The multivariate time series samples are stored in .csv files for each individual sample. Each file name contains a number of pieces of information that we will wish to keep for our prediction task and therefore should be part of your Analytics Base Table. Below are examples of the naming for each sample type.

- __FL__ file name example:M1.0@265:Primary_ar115_s2010-08-06T06:36:00_e2010-08-06T18:24:00.csv
- __NF__ file name example:FQ_ar99_s2010-08-01T19:00:00_e2010-08-02T06:48:00.csv or B1.9@909:Primary_ar325_s2011-01-04T02:36:00_e2011-01-04T14:24:00.csv

Let's look at these formats, starting with those that contain an @ symbol (we will use the __FL__ file as an example but note that the __NF__ data also has files with this naming):
- __M1.0@265:Primary__: This says that there occurs an M1.0 sized flare within 24 hours of our sample. It also says that this flare is numbered 265 in the accompanying integrated flare dataset that comes as a supplementary file to this dataset. Additionally, "Primary" indicates that the intersection with this active region was verified through the primary method described in the paper.  
- __\_ar115__: This indicates which active region the sample comes from in the original unsampled dataset.
- __\_s2010-08-06T06:36:00__: This is the start time of the sample.
- __\_e2010-08-06T18:24:00.csv__: This is the ned time of the sample.

The files that don't contain the @ symbol begin with FQ and do not have any flare occuring within 24 hours of the sample in the file.  __Note__ that both the __FL__ and __NF__ have files that have flares within 24 hours, but the __NF__ ones are smaller flares that we are considering as unimportant and therefore fall in the non-flaring class.  

---


## Question 1: Reading the flare data (25 points)
---

Now that you have an understanding about the data, you will develop a method to read the flaring data and return an object that contains the data from the csv file and some of the information contained in the file name.

Below is the object you will return.

In [None]:
import os
import re
import pandas
import statistics
from pandas import DataFrame
from datetime import datetime

In [None]:
class MVTSSample:
    
    def __init__(self, flare_type:str, start_time:datetime, end_time:datetime, data:DataFrame):
        self._flare_type = flare_type
        self._start_time = start_time
        self._end_time = end_time
        self._data = data
    
    def get_flare_type(self):
        return self._flare_type
    
    def get_start_time(self):
        return self._start_time
    
    def get_end_time(self):
        return self._end_time
    
    def get_data(self):
        return self._data

### About the MVTSSample class
---

The above class represents the data contained in one file. You are to return one of these objects for each call to your method(s). 

- The __flare_type__ is to be one of the following selections (__X__, __M__, __C__, __B__, __A__, __FQ__), and these lables will be derived from the information in the file name. 
- __start_time__ is the start time in the file name
- __end_time__ is the end time in the file name
- __data__ is a Pandas DataFrame which you will load from the csv using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method.  

---

### About your method
---

Your method is to take in the path and name of the file to open, and it is to return one MVTSSample for that file.

Below is a definition for that method, use it and write the code to complete the tasks necessary to return the specified information.  You can use a method call in another code block to test that your method works as required.

In [None]:
def read_flare_mvts(data_dir:str, file_name:str) -> MVTSSample:
    return MVTSSample(file_name[0], re.search('_s(.+?)_e', file_name).group(1),
    re.search('_e(.+?).csv', file_name).group(1), pandas.read_csv(f'{data_dir}/{file_name}', sep='\t'))

In [None]:
data_dir = "C:/Users/User/Desktop/python class/data/partition1/FL"
file_name = "M1.0@265_Primary_ar115_s2010-08-06T13_36_00_e2010-08-07T01_24_00.csv"
results = read_flare_mvts(data_dir, file_name)

## Question 2: Reading the non-flare data (25 points)
---

Same as Question 1, but now you will do it for non-flaring data.

Below is a definition for that method, use it and write the code to complete the tasks necessary to return the specified information. You can use a method call in another code block to test that your method works as required.

In [None]:
def read_non_flare_mvts(data_dir:str, file_name:str) -> MVTSSample:
    return MVTSSample(file_name[:2] if file_name.startswith("F") else file_name[0], 
    re.search('_s(.+?)_e', file_name).group(1), re.search('_e(.+?).csv', file_name).group(1), pandas.read_csv(f'{data_dir}/{file_name}', sep='\t'))

In [None]:
data_dir = "C:/Users/User/Desktop/python class/data/partition1/NF"
file_name_1 = "FQ_ar99_s2010-08-01T19:00:00_e2010-08-02T06:48:00.csv"
file_name_2 = "B1.9@909:Primary_ar325_s2011-01-04T02:36:00_e2011-01-04T14:24:00.csv"
results1 = read_non_flare_mvts(data_dir, file_name_1)
results2 = read_non_flare_mvts(data_dir, file_name_2)

## Question 3: Processing the DataFrame (25 points)
---

Now that you can read individual files to get the multivariate time sries for a sample period, it is time to start building the analytics base table.

The machine learning methods that we will cover this semster are generally applied to tabular data with a set of descriptive features that are used to learn to classify or predict a target feature. To accomplish this with our raw input multivariate time series, we must produce a set of descriptive features from each of the variates of the the time series.  

In this quesion you will process the DataFrame that was returned from your previous two methods to construct a set of descriptive features for each sample. 

---

---

### DataFrame Attributes:

Each file in the dataset contains the following attributes as a variate of the multivariate timeseries sample. 

|              |                  |             |
|--------------|------------------|-------------|
| 1. Timestamp | 2. TOTUSJH       | 3. TOTBSQ   |	
| 4. TOTPOT	   | 5. TOTUSJZ       | 6. ABSNJZH  |	
| 7. SAVNCPP   | 8. USFLUX        | 9. TOTFZ	|
| 10. MEANPOT  | 11. EPSZ	      | 12. MEANSHR |
| 13. SHRGT45  | 14. MEANGAM      | 15. MEANGBT |
| 16. MEANGBZ  | 17. MEANGBH      | 18. MEANJZH |
| 19. TOTFY    | 20. MEANJZD      | 21. MEANALP |	
| 22. TOTFX    | 23. EPSY	      | 24. EPSX	|
| 25. R_VALUE  | 26. CRVAL1       | 27. CRLN_OBS|	
| 28. CRLT_OBS | 29. CRVAL2       | 30. HC_ANGLE|	
| 31. SPEI     | 32. LAT_MIN      | 33. LON_MIN |
| 34. LAT_MAX  | 35. LON_MAX      | 36. QUALITY |	
| 37. BFLARE   | 38. BFLARE_LABEL |	39. CFLARE  |	
| 39. CFLARE_LABEL | 40. MFLARE | 41. MFLARE_LABEL |	
| 42. XFLARE | 43. XFLARE_LABEL | 44. BFLARE_LOC |	
| 45. BFLARE_LABEL_LOC | 46. CFLARE_LOC | 47. CFLARE_LABEL_LOC |	
| 48. MFLARE_LOC | 49. MFLARE_LABEL_LOC | 50. FLARE_LOC |	
| 51. XFLARE_LABEL_LOC | 52. XR_MAX | 53. XR_QUAL |	
|54. IS_TMFI | | |


These columns should be present in your dataframe that was returned from your previous two methods.  We will only be utilizing a fraction of these.  The method description below gives you more information about which ones we will use.

---

### About your method
---
The following will be the variates we will be processing to return features of.

|              |                  |             |
|--------------|------------------|-------------|
| 1. R_VALUE   | 2. TOTUSJH       | 3. TOTBSQ   |	
| 4. TOTPOT	   | 5. TOTUSJZ       | 6. ABSNJZH  |	
| 7. SAVNCPP   | 8. USFLUX        | 9. TOTFZ	|
| 10. MEANPOT  | 11. EPSZ	      | 12. MEANSHR |
| 13. SHRGT45  | 14. MEANGAM      | 15. MEANGBT |
| 16. MEANGBZ  | 17. MEANGBH      | 18. MEANJZH |
| 19. TOTFY    | 20. MEANJZD      | 21. MEANALP |	
| 22. TOTFX    |        	      |         	|

For each of these variates you will calculate two descriptive features: 
- Mean 
- Standard Deviation

We will add more later, but for now, this will be sufficient to demonstrate the analytics base table construction process.

Below is a method defintion, complete it to return the above specified information. You can use a method call in another code block to test that your method works as required.


In [None]:
def calculate_descriptive_features(data:DataFrame)-> DataFrame:
    variates_to_calc_on = [ 'R_VALUE','TOTUSJH','TOTBSQ','TOTPOT','TOTUSJZ','ABSNJZH','SAVNCPP',
                           'USFLUX','TOTFZ','MEANPOT','EPSZ','MEANSHR','SHRGT45','MEANGAM','MEANGBT',
                           'MEANGBZ','MEANGBH','MEANJZH','TOTFY','MEANJZD','MEANALP','TOTFX']
    features_to_return = [ 'R_VALUE_MEAN','R_VALUE_STDDEV',
                          'TOTUSJH_MEAN','TOTUSJH_STDDEV',
                          'TOTBSQ_MEAN','TOTBSQ_STDDEV',
                          'TOTPOT_MEAN','TOTPOT_STDDEV',
                          'TOTUSJZ_MEAN','TOTUSJZ_STDDEV',
                          'ABSNJZH_MEAN','ABSNJZH_STDDEV',
                          'SAVNCPP_MEAN','SAVNCPP_STDDEV',
                          'USFLUX_MEAN','USFLUX_STDDEV',
                          'TOTFZ_MEAN','TOTFZ_STDDEV',
                          'MEANPOT_MEAN','MEANPOT_STDDEV',
                          'EPSZ_MEAN','EPSZ_STDDEV',
                          'MEANSHR_MEAN','MEANSHR_STDDEV',
                          'SHRGT45_MEAN','SHRGT45_STDDEV',
                          'MEANGAM_MEAN','MEANGAM_STDDEV',
                          'MEANGBT_MEAN','MEANGBT_STDDEV',
                          'MEANGBZ_MEAN','MEANGBZ_STDDEV',
                          'MEANGBH_MEAN','MEANGBH_STDDEV',
                          'MEANJZH_MEAN','MEANJZH_STDDEV',
                          'TOTFY_MEAN','TOTFY_STDDEV',
                          'MEANJZD_MEAN','MEANJZD_STDDEV',
                          'MEANALP_MEAN','MEANALP_STDDEV',
                          'TOTFX_MEAN','TOTFX_STDDEV']
                          
    temp_list = []
    df = pandas.DataFrame(data)
    for i in variates_to_calc_on:
        temp_list.append(statistics.mean(df[i].tolist()))
        temp_list.append(statistics.stdev(df[i].tolist()))

    return pandas.DataFrame(data={k:[v] for k,v in zip(features_to_return, temp_list)})

In [None]:
calculate_descriptive_features(results.get_data())

---

## Question 4: Putting it all together (25 points)
---

Now that you have to tools to read the data and process descriptive features, it is time to put this all together to produce an analytics base table for all of the data in Partiton 1.

In this question, you shall construct a method that will process a partition by extracting features for each sample in both the __FL__ and __NF__ subdirectories of that partition.  The extracted descriptive features are to be placed into your analytics base table DataFrame as colums, with the addition of the __FLARE_TYPE__ target feature.

Your method should take in the partition location and assume that there will be __FL__ and __NF__ subdirectories to process.

Your method shall also take in the name of the analytics base table to store. This should be the full name with either an absolute or relative path to store the table also part of the passed in name. 

Below you will find a method defintion, complete it to perform the above specified information. You can use a method call in another code block to test that your method works as required.


In [None]:
def process_partition(partition_location:str, abt_name:str):
    abt_header = [ 'FLARE_TYPE', 'R_VALUE_MEAN','R_VALUE_STDDEV',
                          'TOTUSJH_MEAN','TOTUSJH_STDDEV',
                          'TOTBSQ_MEAN','TOTBSQ_STDDEV',
                          'TOTPOT_MEAN','TOTPOT_STDDEV',
                          'TOTUSJZ_MEAN','TOTUSJZ_STDDEV',
                          'ABSNJZH_MEAN','ABSNJZH_STDDEV',
                          'SAVNCPP_MEAN','SAVNCPP_STDDEV',
                          'USFLUX_MEAN','USFLUX_STDDEV',
                          'TOTFZ_MEAN','TOTFZ_STDDEV',
                          'MEANPOT_MEAN','MEANPOT_STDDEV',
                          'EPSZ_MEAN','EPSZ_STDDEV',
                          'MEANSHR_MEAN','MEANSHR_STDDEV',
                          'SHRGT45_MEAN','SHRGT45_STDDEV',
                          'MEANGAM_MEAN','MEANGAM_STDDEV',
                          'MEANGBT_MEAN','MEANGBT_STDDEV',
                          'MEANGBZ_MEAN','MEANGBZ_STDDEV',
                          'MEANGBH_MEAN','MEANGBH_STDDEV',
                          'MEANJZH_MEAN','MEANJZH_STDDEV',
                          'TOTFY_MEAN','TOTFY_STDDEV',
                          'MEANJZD_MEAN','MEANJZD_STDDEV',
                          'MEANALP_MEAN','MEANALP_STDDEV',
                          'TOTFX_MEAN','TOTFX_STDDEV']

    data_dir1 = "C:/Users/User/Desktop/python class/data/partition1/FL"
    data_dir2 = "C:/Users/User/Desktop/python class/data/partition1/NF"
    analyzed = []
    for file_name in os.listdir(data_dir1):
        try:
            results = read_flare_mvts(data_dir1, file_name)
            data = calculate_descriptive_features(results.get_data())
            data.insert(0, 'FLARE_TYPE', results.get_flare_type())
            analyzed.append(data)
        except:
            print(f'{file_name} was passed')
            continue
    
    for file_name in os.listdir(data_dir2):
        try:
            results = read_non_flare_mvts(data_dir2, file_name)
            data = calculate_descriptive_features(results.get_data())
            data.insert(0, 'FLARE_TYPE', results.get_flare_type())
            analyzed.append(data)
        except:
            print(f'{file_name} was passed')
            continue

    
    df = pandas.concat(analyzed)
    df.to_csv(f'{partition_location}/{abt_name}.csv', sep="\t", index =False, header=True)

In [None]:
data_dir = "C:/Users/User/Desktop/python class/data/partition1"
process_partition(data_dir, "Flare-Analyzed-Data")