# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>
- TAs: Tong Zeng <tozeng@syr.edu>, Priya Matnani <psmatnan@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` statements) are used to grade your answers. However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [1]:
# create spark and sparkcontext objects
from pyspark.sql import SparkSession
import numpy as np

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Warning: Use exclusively Spark 1.6 when asked to do so and Spark 2.0 (dataframes) only in the last question. Do not use Pandas at all in this assignment

# Part 1: Data cleaning and basic analyses

In this part, you will learn to read data from non-standard formats, clean data, and produce some basic analysis of it.

We will use Spark 1.6 (`sparkContext` on variable `sc`) to load text files from which we will extract features that are predictive of a target value. Unfortunately, the data is stored in some non-standard format where each line contains the customer index, the feature index, and the value of the feature for that customer. Similarly, the target files contain in each line the customer index and the target value. We will load these files into two RDDs:

In [2]:
raw_features_rdd = sc.textFile('/datasets/features')
raw_target_rdd = sc.textFile('/datasets/targets')

An issue with the data is that there were problems transmitting the features and targets. If this happened, then a text `ERROR TRANSFERRING` replaced the data. If you look at the first 35 values of the features and target RDD, you will see these types of lines:

```python
In: raw_features_rdd.take(35)
```

```bash
Out: ['customer(4); feature(0); value(-0.79)',
 'customer(4); feature(1); value(-0.28)',
 'customer(4); feature(2); value(-0.26)',
 'customer(4); feature(3); value(0.28)',
 'customer(4); feature(4); value(-0.17)',
 'customer(4); feature(5); value(-0.51)',
 'customer(4); feature(6); value(0.44)',
 'customer(4); feature(7); value(-0.62)',
 'customer(4); feature(8); value(1.82)',
 'customer(4); feature(9); value(-0.35)',
 'customer(14); feature(0); value(0.53)',
 'customer(14); feature(1); value(0.49)',
 'customer(14); feature(2); value(0.29)',
 'customer(14); feature(3); value(0.15)',
 'customer(14); feature(4); value(-0.71)',
 'customer(14); feature(5); value(0.77)',
 'customer(14); feature(6); value(-0.01)',
 'customer(14); feature(7); value(0.86)',
 'customer(14); feature(8); value(0.19)',
 'customer(14); feature(9); value(-0.14)',
 'customer(24); feature(0); value(0.35)',
 'customer(24); feature(1); value(0.63)',
 'customer(24); feature(2); value(0.02)',
 'customer(24); feature(3); value(-0.88)',
 'customer(24); feature(4); value(0.16)',
 'customer(24); feature(5); value(0.78)',
 'customer(24); feature(6); value(-0.34)',
 'customer(24); feature(7); value(-0.06)',
 'customer(24); feature(8); value(1.42)',
 'customer(24); feature(9); value(-0.44)',
 'customer(34); feature(0); value(-0.88)',
 'ERROR TRANSFERRING',
 'customer(34); feature(2); value(-0.69)',
 'customer(34); feature(3); value(0.62)',
 'customer(34); feature(4); value(0.17)']
```

```python
raw_target_rdd.take(35)
```

```bash
['customer(40); target(-157.39)',
 'customer(41); target(-122.48)',
 'customer(42); target(-35.32)',
 'customer(43); target(-117.87)',
 'customer(44); target(112.20)',
 'customer(45); target(-261.49)',
 'customer(46); target(-395.52)',
 'customer(47); target(212.63)',
 'customer(48); target(-62.85)',
 'customer(49); target(-41.15)',
 'customer(140); target(-283.10)',
 'customer(141); target(249.36)',
 'customer(142); target(54.43)',
 'customer(143); target(93.60)',
 'customer(144); target(91.50)',
 'customer(145); target(-140.93)',
 'ERROR TRANSFERRING',
 'customer(147); target(95.28)',
 'customer(148); target(-323.78)',
 'customer(149); target(-293.53)',
 'customer(240); target(-151.64)',
 'customer(241); target(-110.88)',
 'customer(242); target(-109.65)',
 'customer(243); target(99.77)',
 'customer(244); target(-166.56)',
 'customer(245); target(-19.04)',
 'customer(246); target(47.42)',
 'customer(247); target(86.83)',
 'customer(248); target(-161.21)',
 'customer(249); target(110.41)',
 'customer(340); target(-106.21)',
 'customer(341); target(203.06)',
 'customer(342); target(101.65)',
 'customer(343); target(50.91)',
 'ERROR TRANSFERRING']
```

In [3]:
# try it yourself: raw_features_rdd.take(35) and raw_target_rdd.take(35)
raw_features_rdd.take(35)

['customer(7); feature(0); value(-0.67)',
 'customer(7); feature(1); value(0.00)',
 'customer(7); feature(2); value(-0.36)',
 'customer(7); feature(3); value(1.44)',
 'customer(7); feature(4); value(-0.19)',
 'customer(7); feature(5); value(-0.00)',
 'customer(7); feature(6); value(0.06)',
 'customer(7); feature(7); value(-0.20)',
 'customer(7); feature(8); value(0.72)',
 'customer(7); feature(9); value(0.71)',
 'customer(17); feature(0); value(0.48)',
 'customer(17); feature(1); value(-0.34)',
 'customer(17); feature(2); value(0.79)',
 'customer(17); feature(3); value(0.25)',
 'customer(17); feature(4); value(0.20)',
 'customer(17); feature(5); value(0.57)',
 'customer(17); feature(6); value(0.28)',
 'customer(17); feature(7); value(-0.48)',
 'customer(17); feature(8); value(0.93)',
 'customer(17); feature(9); value(-0.14)',
 'customer(27); feature(0); value(0.02)',
 'customer(27); feature(1); value(-0.44)',
 'customer(27); feature(2); value(-0.55)',
 'ERROR TRANSFERRING',
 'customer(

In [4]:
raw_target_rdd.take(35)

['customer(70); target(52.69)',
 'customer(71); target(-215.71)',
 'customer(72); target(125.61)',
 'ERROR TRANSFERRING',
 'customer(74); target(-147.76)',
 'customer(75); target(35.35)',
 'customer(76); target(-104.40)',
 'customer(77); target(-25.67)',
 'customer(78); target(-140.08)',
 'customer(79); target(-50.70)',
 'customer(170); target(71.78)',
 'ERROR TRANSFERRING',
 'customer(172); target(-112.59)',
 'customer(173); target(-482.54)',
 'customer(174); target(-191.87)',
 'customer(175); target(-244.78)',
 'customer(176); target(396.75)',
 'customer(177); target(-19.60)',
 'customer(178); target(89.62)',
 'ERROR TRANSFERRING',
 'customer(270); target(77.86)',
 'customer(271); target(177.87)',
 'ERROR TRANSFERRING',
 'ERROR TRANSFERRING',
 'customer(274); target(80.61)',
 'customer(275); target(-133.14)',
 'customer(276); target(238.34)',
 'customer(277); target(89.10)',
 'customer(278); target(36.88)',
 'customer(279); target(29.02)',
 'customer(370); target(-96.34)',
 'customer

## Question 1:

Filter out the lines that contain `ERROR TRANSFERRING` and store them in `raw_features2_rdd` and `raw_target2_rdd` respectively. 

In [5]:
# create raw_features2_rdd and raw_targets2_rdd below
# YOUR CODE HERE
raw_features2_rdd = raw_features_rdd.filter(lambda x: x != 'ERROR TRANSFERRING')
raw_target2_rdd = raw_target_rdd.filter(lambda x: x != 'ERROR TRANSFERRING')
#raise NotImplementedError()

In [6]:
# check that things work
print(raw_features2_rdd.count())
print(raw_target2_rdd.count())

95036
8968


In [7]:
"""5 pts: Check that the lines are properly discarded"""
assert raw_features2_rdd.count() == 95036
assert raw_target2_rdd.count() == 8968

## Question 2:
You will further process `raw_features2_rdd` such that you will create a key-value RDD of the following form: the key is the customer index as an integer and the value is a dictionary whose key is a string `f_0`, `f_1`, ..., `f_9` for feature index 0, 1, ... 9, respetively, and the value is a floating point number of the feature value. 

Define a function `map_features2` that performs such key-value pair creation.

In [8]:
def map_features2(line):
    parts = line.split(';')
    parts[0] = parts[0].replace('customer(','')
    parts[0] = parts[0].replace(')','').strip()
    parts[1] = parts[1].replace('feature(','f_')
    parts[1] = parts[1].replace(')','').strip()
    parts[2] = parts[2].replace('value(','')
    parts[2] = parts[2].replace(')','').strip()
    maped_rdd = [int(parts[0]),{parts[1]:parts[2]}]
    return maped_rdd
    # YOUR CODE HERE
    raise NotImplementedError()

For example, for the input element:

`'customer(4); feature(0); value(-0.79)'`

it should generate
```python
[4, {'f_0': -0.79}]
```

In [9]:
# test it here
raw_features2_rdd.\
    map(map_features2).\
    take(10)

[[7, {'f_0': '-0.67'}],
 [7, {'f_1': '0.00'}],
 [7, {'f_2': '-0.36'}],
 [7, {'f_3': '1.44'}],
 [7, {'f_4': '-0.19'}],
 [7, {'f_5': '-0.00'}],
 [7, {'f_6': '0.06'}],
 [7, {'f_7': '-0.20'}],
 [7, {'f_8': '0.72'}],
 [7, {'f_9': '0.71'}]]

In [10]:
"""5 pts: Check that the new raw_features2_rdd and raw_target2_rdd RDDs are correct"""
# key is an integer
np.testing.assert_equal(type(raw_features2_rdd.map(map_features2).first()[0]), int)
# value is a dictionary
np.testing.assert_equal(type(raw_features2_rdd.map(map_features2).first()[1]), dict)

## Question 3:

You will create a function `map_target2` that will be applied to `raw_target2_rdd`. This function will create key-value pair where the key is the customer index as an integer and the value is the floating point representation of the target. Assign the resulting RDD to `raw_target3_rdd`. 

In [11]:
def map_target2(line):
    parts = line.split(';')
    parts[0] = parts[0].replace('customer(','')
    parts[0] = parts[0].replace(')','').strip()
    parts[1] = parts[1].replace('target(','')
    parts[1] = parts[1].replace(')','').strip()
    maped_rdd = [int(parts[0]),float(parts[1])]
    return maped_rdd
    raise NotImplementedError()

In [15]:
# make the assignment here
raw_target3_rdd = raw_target2_rdd.map(map_target2)

A sample of results:

```python
raw_target2_rdd.map(map_target2).take(5)
```

```console
[[40, -157.39],
 [41, -122.48],
 [42, -35.32],
 [43, -117.87],
 [44, 112.2],
 [45, -261.49],
 [46, -395.52],
 [47, 212.63],
 [48, -62.85],
 [49, -41.15]]
```

In [16]:
# try it yourself
raw_target2_rdd.map(map_target2).take(5)

[[70, 52.69], [71, -215.71], [72, 125.61], [74, -147.76], [75, 35.35]]

In [18]:
"""5 pts: Check that raw_target3_rdd contains the right values"""
# check types
np.testing.assert_equal(type(raw_target3_rdd.keys().first()), int)
np.testing.assert_equal(type(raw_target3_rdd.values().first()), float)
# the sum of all targets
np.testing.assert_approx_equal(raw_target3_rdd.values().sum(), -179351.71, significant=3)

## Question 4:

In this question, you will use map reduce to produce an RDD of key-value pairs where the key is the customer index and the value is a dictionairy with all the features and values associated with that customer. Notice that the map part of the map-reduce is already defined by `map_features2` on `raw_features2_rdd`. Therefore, define the proper `reduce_features2` function to produce the desired results. Create a RDD named `raw_features3_rdd` with the results:

In [19]:
def reduce_features2(v1, v2):
    val = v1.copy()
    val.update(v2)
    return val
    #raise NotImplementedError()
# Apply mapreduce to produce the raw_features3_rdd from raw_features2_rdd
# YOUR CODE HERE
#raise NotImplementedError()

In [20]:
raw_features3_rdd = raw_features2_rdd.map(map_features2).reduceByKey(reduce_features2)
raw_features3_rdd.first()

(5020,
 {'f_0': '0.60',
  'f_1': '0.07',
  'f_2': '0.21',
  'f_3': '-0.18',
  'f_4': '-0.34',
  'f_6': '0.20',
  'f_7': '-0.19',
  'f_8': '-0.86',
  'f_9': '-0.30'})

Running the map reduce should produce the following example result:
```python
raw_features3_rdd.take(2)
```

```console
[(0,
  {'f_0': -0.57,
   'f_1': -0.38,
   'f_2': 0.0,
   'f_3': -0.07,
   'f_4': -0.28,
   'f_5': -0.79,
   'f_7': 0.28,
   'f_8': 1.65,
   'f_9': 0.57}),
 (10,
  {'f_0': -0.89,
   'f_1': 0.3,
   'f_2': 0.62,
   'f_3': -0.21,
   'f_4': -1.02,
   'f_5': -0.28,
   'f_6': 0.54,
   'f_7': 1.83,
   'f_8': -0.35,
   'f_9': 0.55})]
```

In [22]:
"""10 pts: Check that raw_features3_rdd has the correct format and values. There could be hidden tests!"""
# key is an integer
np.testing.assert_equal(type(raw_features3_rdd.first()[0]), int)
# value is a dictionary
np.testing.assert_equal(type(raw_features3_rdd.first()[1]), dict)

## Question 5:

Use the different types of joins between `raw_target3_rdd` and `raw_features3_rdd` (`join`, `leftOuterJoin`, `rightOuterJoin`, or `fullOuterJoin`) combined with possible filters to create RDDs with elements of the form

`[customer_index, (target, feature_dict)]`

where `target` comes from `raw_target3_rdd`, and `feature_dict` is the dictionary with features from `raw_features3_rdd`.

Create the following variables containing RDDs:

- `rdd1`: data for customers who have all 10 features and a target
- `rdd2`: data for customers who have at least one feature and a target
- `rdd3`: data for customers who have all 10 features and may or may not have a target
- `rdd4`: data for customers who have at least one features and may or may not have a target
- `rdd5`: data for customers who may or may not have features and have a target

For example, for `rdd1`

```python
rdd1.sortByKey().take(2)
```
should return
```console
[(1,
  (36.67,
   {'f_0': 0.5,
    'f_1': 0.8,
    'f_2': -0.49,
    'f_3': 0.25,
    'f_4': 0.37,
    'f_5': 0.73,
    'f_6': -0.43,
    'f_7': 0.89,
    'f_8': -1.85,
    'f_9': -0.44})),
 (3,
  (-429.54,
   {'f_0': 0.14,
    'f_1': -0.87,
    'f_2': -0.94,
    'f_3': 0.09,
    'f_4': -0.69,
    'f_5': -0.29,
    'f_6': -0.45,
    'f_7': -0.6,
    'f_8': -1.28,
    'f_9': -0.38}))]
```

In [23]:
# create rdd1 here
# rdd1: data for customers who have all 10 features and a target
rdd1 = raw_features3_rdd.filter(lambda x: len(x[1]) == 10).join(raw_target3_rdd.filter(lambda x : len(x) == 2))
# raise NotImplementedError()

In [24]:
"""2 pts: Test if `rdd1` has the right data. Remember that there could be hidden tests!"""
# number of elements expected
np.testing.assert_equal(rdd1.count(), 5379)

In [25]:
# create rdd2 here
# rdd2: data for customers who have at least one feature and a target
rdd2 = raw_features3_rdd.filter(lambda x: len(x[1]) > 1).join(raw_target3_rdd.filter(lambda x : len(x) == 2))
#raise NotImplementedError()

In [26]:
"""2 pts: Test if `rdd2` has the right data. Remember that there could be hidden tests!"""
# number of elements expected
np.testing.assert_equal(rdd2.count(), 8968)

In [27]:
# create rdd3 below
# rdd3: data for customers who have all 10 features and may or may not have a target
rdd3 = raw_features3_rdd.filter(lambda x: len(x[1]) == 10).leftOuterJoin(raw_target3_rdd.filter(lambda x : len(x) == 1))
#raise NotImplementedError()

In [28]:
"""2 pts: Test if `rdd3` has the right data. Remember that there could be hidden tests!"""
# number of elements expected
np.testing.assert_equal(rdd3.count(), 5984)

In [29]:
# create rdd4 here
# rdd4: data for customers who have at least one features and may or may not have a target
rdd4 = raw_features3_rdd.filter(lambda x: len(x[1]) > 1).leftOuterJoin(raw_target3_rdd.filter(lambda x : len(x) == 1))
#raise NotImplementedError()

In [30]:
"""2 pts: Test if `rdd4` has the right data. Remember that there could be hidden tests!"""
# number of elements expected
np.testing.assert_equal(rdd4.count(), 10000)

In [31]:
# create rdd5 here
# rdd5: data for customers who may or may not have features and have a target
rdd5 = raw_features3_rdd.filter(lambda x: len(x[1]) > 1).rightOuterJoin(raw_target3_rdd.filter(lambda x : len(x) == 2))
#raise NotImplementedError()

In [32]:
"""2 pts: Test if `rdd5` has the right data. Remember that there could be hidden tests!"""
# number of elements expected
np.testing.assert_equal(rdd5.count(), 8968)
# elements with no targets
np.testing.assert_equal(rdd5.values().filter(lambda x: x[0] is None).count(), 0)


## Question 6:

The preprocessed Parquet file for customers who had only ten features and a target present is available in `/datasets/feats_target.parquet`. We will load it into a variable `df`

In [33]:
df = spark.read.parquet('/datasets/feats_target.parquet')

Explore the dataframe a bit:

```python
df.show(5)
```

```console
+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-------+
|customer_index|  f_0|  f_1|  f_2|  f_3|  f_4|  f_5|  f_6|  f_7|  f_8|  f_9| target|
+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-------+
|          5135|-0.34|-0.66|-0.11|-0.07|-0.14| 0.03|-0.71|-0.18| -0.4|  0.2|-271.22|
|          5235| 0.58|-0.69| 0.59| 0.69|  0.4| 0.65|-0.59|-0.78|-0.56|-0.14|-103.82|
|          5335|-0.13|  0.1|-0.13| 0.66|-1.06| -0.9|-0.26| 0.65|-0.08| -0.4|-287.26|
|          5435|-0.45| 0.15| 0.39|-0.65| -0.4| 0.38|-0.25| 0.57| 0.33|-0.01|-114.26|
|          5635|-0.85|-0.22|-0.51|  0.4| 0.31|-0.79| 0.17| 0.91| -0.6|-0.12|  44.81|
+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-------+
only showing top 5 rows
```

In [34]:
# explore it yourself
df.show(5)

+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-------+
|customer_index|  f_0|  f_1|  f_2|  f_3|  f_4|  f_5|  f_6|  f_7|  f_8|  f_9| target|
+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-------+
|          5135|-0.34|-0.66|-0.11|-0.07|-0.14| 0.03|-0.71|-0.18| -0.4|  0.2|-271.22|
|          5235| 0.58|-0.69| 0.59| 0.69|  0.4| 0.65|-0.59|-0.78|-0.56|-0.14|-103.82|
|          5335|-0.13|  0.1|-0.13| 0.66|-1.06| -0.9|-0.26| 0.65|-0.08| -0.4|-287.26|
|          5435|-0.45| 0.15| 0.39|-0.65| -0.4| 0.38|-0.25| 0.57| 0.33|-0.01|-114.26|
|          5635|-0.85|-0.22|-0.51|  0.4| 0.31|-0.79| 0.17| 0.91| -0.6|-0.12|  44.81|
+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-------+
only showing top 5 rows



The subpackage `pyspark.ml.functions` (aliased as `fn` below) contains many common functions for data analysis. Find the function for computing the __correlation__ (the strength of the linear relationship between two variables) between two columns, the function for computing __absolute__ values, and create a data frame `correlations_df` that contains the following columns in the following order:

1. `c0_target`: correlation between feature 0 and target
1. `c1_target`: correlation between feature 1 and target
1. `c2_target`: correlation between feature 2 and target
1. `c3_target`: correlation between feature 3 and target
1. `c4_target`: correlation between feature 4 and target
1. `c5_target`: correlation between feature 5 and target
1. `c6_target`: correlation between feature 6 and target
1. `c7_target`: correlation between feature 7 and target
1. `c8_target`: correlation between feature 8 and target
1. `c9_target`: correlation between feature 9 and target
1. `sig0`: boolean `true` if the absolute value of the correlation between feature 0 and target is greater than 0.5, `false` o.w.
1. `sig1`: boolean `true` if the absolute value of the correlation between feature 1 and target is greater than 0.5, `false` o.w.
1. `sig2`: boolean `true` if the absolute value of the correlation between feature 2 and target is greater than 0.5, `false` o.w.
1. `sig3`: boolean `true` if the absolute value of the correlation between feature 3 and target is greater than 0.5, `false` o.w.
1. `sig4`: boolean `true` if the absolute value of the correlation between feature 4 and target is greater than 0.5, `false` o.w.
1. `sig5`: boolean `true` if the absolute value of the correlation between feature 5 and target is greater than 0.5, `false` o.w.
1. `sig6`: boolean `true` if the absolute value of the correlation between feature 6 and target is greater than 0.5, `false` o.w.
1. `sig7`: boolean `true` if the absolute value of the correlation between feature 7 and target is greater than 0.5, `false` o.w.
1. `sig8`: boolean `true` if the absolute value of the correlation between feature 8 and target is greater than 0.5, `false` o.w.
1. `sig9`: boolean `true` if the absolute value of the correlation between feature 9 and target is greater than 0.5, `false` o.w.

**Hint: Remember that you can pass a list of columns to `df.select`. You can create such list with list comprehension, saving a lot of code**

In [35]:
# import the package functions as fn
from pyspark.sql import functions as fn

In [36]:
# apply some function to the columns: df.select(...)
df.select(1 + fn.col('target')).show(10)
f_Cols= df.columns[1:11]

+------------+
|(target + 1)|
+------------+
|     -270.22|
|     -102.82|
|     -286.26|
|     -113.26|
|       45.81|
|      -41.24|
|      250.34|
|      313.92|
|      267.92|
|      -18.14|
+------------+
only showing top 10 rows



In [39]:
# Create the dataframe `correlations_df` here
cor_df = df.select([fn.corr(x,'target').alias('c'+ x.split('_')[1]+'_target') for x in f_Cols])
sig_df = df.select([fn.when(fn.abs(fn.corr(x,'target'))>0.5, True).otherwise(False).alias('sig'+ x.split('_')[1]) for x in f_Cols])
cor_df= cor_df.withColumn("join_id", fn.monotonically_increasing_id())
sig_df= sig_df.withColumn("join_id", fn.monotonically_increasing_id())
correlations_df= cor_df.join(sig_df,"join_id","outer").drop("join_id")
correlations_df.show()
#raise NotImplementedError()

+--------------------+------------------+-------------------+--------------------+------------------+--------------------+------------------+-------------------+--------------------+--------------------+-----+----+-----+-----+----+-----+----+-----+-----+-----+
|           c0_target|         c1_target|          c2_target|           c3_target|         c4_target|           c5_target|         c6_target|          c7_target|           c8_target|           c9_target| sig0|sig1| sig2| sig3|sig4| sig5|sig6| sig7| sig8| sig9|
+--------------------+------------------+-------------------+--------------------+------------------+--------------------+------------------+-------------------+--------------------+--------------------+-----+----+-----+-----+----+-----+----+-----+-----+-----+
|0.004550467203235565|0.5175418076531231|0.24222549922181658|-0.02725292295686...|0.6109343603342663|-0.01506188705709...|0.5763480950714736|0.06334859042481801|0.007563889344988309|-0.01715842226040...|false|true|fal

In [40]:
"""10 pts: Check that the dataframe has the correct columns and values. There could be hidden tests!"""
# check column names
column_names = ['c' + str(fi) + '_target' for fi in range(10)] + \
               ['sig' + str(fi) for fi in range(10)]

# column's names and positions in the right order
np.testing.assert_equal(correlations_df.columns, column_names)

