# <span style="color:blue">Programming for Data Science - DS-GA 1007</span>
## <span style="color:blue">Homework 8: NumPy and pandas</span>

### Due Date: Sunday 11/17, 11:59 PM

We will explore some aspects of the NumPy and pandas packages. Along the way, we will get experience with manipulating arrays and tables. These skills are helpful throughout data science particularly in data collection and data processing. By completing Homework 8, you should take away...

- Practice accessing entries of arrays through logical expressions

- Gain experience with combining Series, filtering DataFrames and reading/writing tabular data from different file formats

### Submission Instructions
For this assignment and future assignments (Homework 9,10,11) you will submit a copy to Gradescope. Follow these steps

1. Download as HTML (`File->Download As->HTML(.html)`). 
1. Open the HTML in the browser. Print to .pdf 
1. Upload to Gradescope. Tag your answers. 

Note that 

- Please map your answers to our questions. Otherwise you may lose points. Please see the rubric below. 
- You should break long lines of code into multiple lines. Otherwise your code will extend out of view from the cell. Consider using `\` followed by a new line. 
- For each textual response, please include relevant code that informed your response. For each plotting question, please include the code used to generate the plot.
- You should not display large output cells such as all rows of a table. Instead convert the input cell from Code to Markdown back to Code to remove the output cell.


### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about
the homework, we ask that you **write your solutions individually**. If you do
discuss the assignments with others please **include their names** at the top
of your solution.


### Rubric

Question | Points
--- | ---
Gradescope | 2
Question 1 | 1
Question 2 | 1
Question 3 | 2
Question 4 | 3
Question 5 | 2
Question 6 | 2
Question 7 | 2
Question 8 | 3
Total | 16

Please import the following packages

In [4]:
import numpy as np
from numpy.testing import assert_array_equal
import pandas as pd
from countwords import load_word_counts

### Exercise 1 (1.0 point)
Using NumPy, complete the function below. The function should create and return the following 2D array:
```
[[ 1  6 11 16 21 26 31 36]
 [ 2  7 12 17 22 27 32 37]
 [ 3  8 13 18 23 28 33 38]
 [ 4  9 14 19 24 29 34 39]
 [ 5 10 15 20 25 30 35 40]]
```
You must find a way to generate the array without constructing it explicitly.

In [5]:
def create_array():
    ### BEGIN SOLUTION
    
    result = np.arange(1,41).reshape(8,5).T
    
    ### END SOLUTION
    return result

In [6]:
create_array()

array([[ 1,  6, 11, 16, 21, 26, 31, 36],
       [ 2,  7, 12, 17, 22, 27, 32, 37],
       [ 3,  8, 13, 18, 23, 28, 33, 38],
       [ 4,  9, 14, 19, 24, 29, 34, 39],
       [ 5, 10, 15, 20, 25, 30, 35, 40]])

In [7]:
# if your code works properly, the assert below should not raise a message
assert_array_equal(create_array(), np.array([[ 1,  6, 11, 16, 21, 26, 31, 36],
                                             [ 2,  7, 12, 17, 22, 27, 32, 37],
                                             [ 3,  8, 13, 18, 23, 28, 33, 38],
                                             [ 4,  9, 14, 19, 24, 29, 34, 39],
                                             [ 5, 10, 15, 20, 25, 30, 35, 40]]))

### Exercise 2 (1.0 point)
The function below should return an array containing the second, fourth and fifth rows from the input array `original`. The argument (input array) must be a 2D array. If the argument is not a 2D array or does not have the required number of rows, return `None`. 

In [8]:
def new_array_second_fourth_fifth(original):
    ### BEGIN SOLUTION
    
    if type(original) is not np.ndarray: # check if array
        result = None
    elif len(original.shape) != 2: # check if 2D array
        result = None
    elif original.shape[0] > 5: # check if 5 rows exist
        result = None
    else:
        indices = np.array([1,3,4]) # zero-indexed rows wanted
        result = original[indices,0:]
    
    ### END SOLUTION
    return result

In [9]:
# test
new_array_second_fourth_fifth(create_array())

array([[ 2,  7, 12, 17, 22, 27, 32, 37],
       [ 4,  9, 14, 19, 24, 29, 34, 39],
       [ 5, 10, 15, 20, 25, 30, 35, 40]])

In [10]:
# if your code works properly, the asserts below should not raise a message
assert_array_equal(new_array_second_fourth_fifth([1,2,3]), None)
assert_array_equal(new_array_second_fourth_fifth(create_array()), 
                   np.array([
                       [ 2,  7, 12, 17, 22, 27, 32, 37],
                       [ 4,  9, 14, 19, 24, 29, 34, 39],
                       [ 5, 10, 15, 20, 25, 30, 35, 40]]))

### Exercise 3 (2.0 point) 

### 3a (1.0 point)

NumPy provides a function called `where`, which allows to find elements in an array that satisfies a conditional statement. 

Using the function `where`, complete the function below such that it returns a flattened array with the elements from the input array `original` that are multiples of three. Return `None` if there is no multiple of three. 

The input must be a NumPy array (`np.ndarray`). If the argument is invalid, return `None`.

In [11]:
def multi_three(original):
    ### BEGIN SOLUTION
    if type(original) is not np.ndarray: # check if array
        result = None
    else:
        mult_three = np.where(original % 3 == 0)
    ### END SOLUTION
    return original[mult_three]

In [12]:
multi_three(create_array())

array([ 6, 21, 36, 12, 27,  3, 18, 33,  9, 24, 39, 15, 30])

In [13]:
# if your code works properly, the asserts below should not raise a message
assert_array_equal(multi_three(create_array()), 
                   np.array([ 6, 21, 36, 12, 27,  3, 18, 33,  9, 24, 39, 15, 30]))
assert_array_equal(multi_three(np.array([[1,2,15],[1,2,3]])),np.array([15,3]))
assert_array_equal(multi_three(np.array([[3,7,9,2,2,15]])), np.array([3,9,15]))

### 3b (1.0 point)

Write a custom implementation of `np.where`, without using `np.where`. This should match the signature of `np.where` exactly. For simplicity, your `custom_where` method only needs to work on 1-D and 2-D arrays.

In [26]:
# ls = [(0,0),(1,1),(3,0),(4,0)]
# ls_1 = []
# ls_2 = []
# for i,j in ls:
#     ls_1.append(i)
#     ls_2.append(j)

# [i[0] for i in ls]
# [i[1] for i in ls]

def find_true_indices(temp):
    indices_to_keep = []
    for index, value in np.ndenumerate(temp):
        if value == True:
            indices_to_keep.append(index)
    return indices_to_keep

def converter(some_list):
    some_array = np.asarray(some_list)
    return some_array.astype('int64')

def custom_where(boolean_arr):
    list_of_tuples = find_true_indices(boolean_arr)
    number_of_indices = len(list_of_tuples[0])
    tuple_of_lists = ([idx[entry] for idx in list_of_tuples] for entry in range(number_of_indices))
    output = tuple(map(converter, tuple_of_lists))
    return output

In [25]:
def custom_where(boolean_arr):
    ### BEGIN SOLUTION
    
    if len(boolean_arr.shape)==2:
        len(boolean_arr.shape)
        arr_1 = []
        arr_2 = []
        for i in range(boolean_arr.shape[0]): 
            for j in range(boolean_arr.shape[1]):
                if boolean_arr[i][j] == True:
                    arr_1.append(i)
                    arr_2.append(j)
        output = arr_1, arr_2
    
    else:
        arr_1 = []
        for i in range(boolean_arr.shape[0]): 
            if boolean_arr[i] == True:
                arr_1.append(i)
        output = (np.array(arr_1),)
        
    ### END SOLUTION
    return output

In [27]:
custom_where(create_array() % 4 == 0)
#np.where(np.array([True, False, True, True, False]))

(array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4]), array([3, 7, 2, 6, 1, 5, 0, 4, 3, 7]))

In [28]:
# if your code works properly, the asserts below should not raise a message
assert_array_equal(custom_where(create_array() % 4 == 0),
                   np.where(create_array() % 4 == 0))
assert_array_equal(custom_where(np.array([True, False, True, True, False])),
                   np.where(np.array([True, False, True, True, False])))

### Exercise 4 (2.0 point) 

### 4a (0.5 point)

NumPy provides a function called `logical_and`. Using `logical_and`, complete the function below such that it returns an array with the elements from the input array that are in the interval between 3 and 11, inclusive. If there is no element in the sought interval the function should return `None`. The argument must be a NumPy array. If the argument is invalid, return `None`.

(You should not have to use any loops for **4a**.)

In [14]:
def get_interval_descending(original):
    ### BEGIN SOLUTION
    if type(original) is not np.ndarray: # check if array
        result = None
    else: 
        result = original[np.logical_and(original>=3, original<=11)]
        if result.size < 0:
            result = None
    ### END SOLUTION
    return result

In [15]:
# if your code works properly, the asserts below should not raise a message
assert_array_equal(get_interval_descending(create_array()), np.array([6, 11,  7,  3,  8,  4,  9,  5, 10]))
assert_array_equal(get_interval_descending(np.array([1, 2, 12, 13])), None)
assert_array_equal(get_interval_descending(np.array([[1, 2],[11, 13]])), np.array([11]))

### 4b (0.5 point)

Write `get_interval_descending_v2` which does the same as the above, except using the logical operator `&` instead of `np.logical_and`.

In [16]:
def get_interval_descending_v2(original):
    ### BEGIN SOLUTION
    if type(original) is not np.ndarray: # check if array
        result = None
    else: 
        result = original[(original >= 3) & (original <=11)]
        if result.size < 0:
            result = None
        
    ### END SOLUTION
    return result

In [17]:
# if your code works properly, the asserts below should not raise a message
assert_array_equal(get_interval_descending_v2(create_array()),
                   np.array([6, 11,  7,  3,  8,  4,  9,  5, 10]))
assert_array_equal(get_interval_descending_v2(np.array([1, 2, 12, 13])), None)
assert_array_equal(get_interval_descending_v2(np.array([[1, 2],[11, 13]])), np.array([11]))

### 4c (1.0 point)

Write a function `logical_and_many` that takes a list of boolean arrays and computes the logical and across all of them. You can use either `np.logical_and` or the `&` operator in this case.

In [18]:
def logical_and_many(boolean_arr_ls):
    ### BEGIN SOLUTION
    
    test = np.array(boolean_arr_ls)
    result = test.all(axis=0)
    
    ### END SOLUTION
    return result

In [19]:
# if your code works properly, the asserts below should not raise a message
arr = np.arange(20)
assert_array_equal(arr[logical_and_many([arr%2==0, arr>7, arr!=14])],
                   np.array([ 8, 10, 12, 16, 18]))

### Exercise 5 (2.0 point) 

The `one_selection` function takes as input a 2D array `original` and returns a 1D array that contains the element closest to $1$ for each row. Please do not use loops or comprehensions.

Hint: Use the NumPy function `argmin`.

In [20]:
import numpy as np

def one_selection(original):
    ### BEGIN SOLUTION
    cols = (np.abs(v - 1)).argmin(axis=1)
    rows = range(original.shape[0])
    result = original[rows,cols]
    ### END SOLUTION
    return(result)

In [21]:
v = np.array([[  0,  -7,  -4],
              [-10,   5,  -1],
              [  6,   3,  -4],
              [  4,   6,   2],
              [ -8,  -4,   6]])
one_selection(v)

array([ 0, -1,  3,  2, -4])

In [22]:
# if your code works properly, the asserts below should not raise a message
from numpy.testing import assert_array_equal
v = np.array([[  0,  -7,  -4],
              [-10,   5,  -1],
              [  6,   3,  -4],
              [  4,   6,   2],
              [ -8,  -4,   6]])
assert_array_equal(one_selection(v), np.array([0, -1,  3,  2, -4]))

### Exercise 6 (2.0 point) 
Consider the DataFrame `df` as built in the code below. Using pandas, write a function that recieves as input the given DataFrame and return a new DataFrame containing only the rows from column $1$ and $3$ whose value in column 'Animal' is _Dino_. You should be able to do this using only pandas indexing.

For instance, if this is the input dataframe:
```
     0	1	2	3	4	Animal
-------------------------------
0	8	4	8	7	8	Ptero
1	4	2	7	8	7	Ptero
2	9	9	3	4	4	Dino
3	5	4	3	2	6	Ptero
4	6	8	9	6	7	Dino
5	3	2	9	7	3	Ptero
6	9	2	2	5	7	Ptero
7	1	2	9	2	8	Croco
8	4	1	0	3	7	Croco
9	0	8	2	9	4	
```
the function should return:
````
     1	3	Animal
-------------------
2	9	4	Dino
4	8	6	Dino
```

In [23]:
def select_dino(df_input):
    ### BEGIN SOLUTION
    result = df_input[df_input['Animal'] == 'Dino']
    result = result[[1,3,'Animal']]
    ### END SOLUTION
    return(result)

In [24]:
m = np.random.randint(low=0,high=10,size=((10,5)))
df = pd.DataFrame(data=m)
animal_names = ['Croco','Dino','Anacon','Ptero']
animals = np.random.choice(animal_names, 10, p=[0.3, 0.2, 0.2, 0.3])
df['Animal'] = animals
#print(df)
select_dino(df)

Unnamed: 0,1,3,Animal
4,8,1,Dino
6,6,2,Dino
7,1,8,Dino


In [25]:
# if your code works properly, the asserts below should not raise a message
m = np.random.randint(low=0,high=10,size=((10,5)))
df = pd.DataFrame(data=m)
animal_names = ['Croco','Dino','Anacon','Ptero']
animals = np.random.choice(animal_names, 10, p=[0.3, 0.2, 0.2, 0.3])
df['Animal'] = animals

dinos_ids = np.where(animals == 'Dino')
assert_array_equal(select_dino(df)[[1,3]].values, m[dinos_ids][:,[1,3]])

### Exercise 7 (2.0 point) 
The pandas module provides a function called `concat` that can be applied to concatenate two pandas series. Using the function `concat`, complete the function below such that it receives two series S1 and S2 as input and returns a new series containing the most frequent value from S1 and S2 repeated as many times as in S1 and S2. The index of the returned series should match the indices where the frequent values apear in S1 and S2.

For instance, given the series:
```
  S1         S2
------     ------
0    1     0    7
1    6     1    9
2    6     2    2
3    8     3    1
4    2     4    9
5    6     5    0
6    6     6    8
```
the function should return:
```
result
------
1    6
2    6
5    6
6    6
1    9
4    9
```

Hint: Use the `value_counts` method to determine the most common numbers from each Series

In [26]:
def concat_frequents(s1,s2):
    ### BEGIN SOLUTION
    
    s1_max = s1.value_counts().index[0]
    s1_max = s1.value_counts().index[0]
    s1_input = s1[s1 == s1_max]
    
    s2_max = s2.value_counts().index[0]
    s2_max = s2.value_counts().index[0]
    s2_input = s2[s2 == s2_max]
    
    result = pd.concat([s1_input,s2_input])

    ### END SOLUTION
    return(result)

In [27]:
s1 = pd.Series([1,6,6,8,2,6,6])
s1_max = s1.value_counts().index[0]
s1[s1 == s1_max]

s2 = pd.Series([7,9,2,1,9,0,8])
s2_max = s2.value_counts().index[0]
s2[s2 == s2_max]


concat_frequents(s1,s2)

1    6
2    6
5    6
6    6
1    9
4    9
dtype: int64

In [28]:
# if your code works properly, the asserts below should not raise a message

s1 = pd.Series([1,6,6,8,2,6,6])
s2 = pd.Series([7,9,2,1,9,0,8])

s = concat_frequents(s1,s2)

assert_array_equal(list(s.values),[6, 6, 6, 6, 9, 9])
assert_array_equal(list(s.index),[1, 2, 5, 6, 1, 4])

## Exercise 8 (2.0 Points)

We want to process the data from Homework 7 that counted the occurence of words. Here we will use pandas to convert the data to a DataFrame, format the columns, filter the rows, and output to a .tsv file. 

$1.$ Use `load_word_counts` to read the data in `isles.dat` into a list called `word_count_frequency`. Each entry of 'word_count_frequency' should be a tuple.

In [29]:
### BEGIN SOLUTION 

file_name = 'isles.dat'
word_count_frequency = load_word_counts(file_name)

### END SOLUTION

$2$. Create a DataFrame called `df_words` containing the entries of `word_count_frequency` as rows. The DataFrame should have three columns called `Word`, `Count`, `Frequency`. 

In [30]:
### BEGIN SOLUTION 

df_words = pd.DataFrame(word_count_frequency, columns=['Word','Count','Frequency'])
df_words.head()

### END SOLUTION 

Unnamed: 0,Word,Count,Frequency
0,the,3822,6.737176
1,of,2460,4.33633
2,and,1723,3.037194
3,to,1479,2.607086
4,a,1308,2.305658


$3.$ Create another DataFrame called `df_words_filtered` dropping all rows containing words with fewer than 30 occurences.

In [31]:
### BEGIN SOLUTION 

df_words_filtered = df_words[df_words['Count']>=30]
df_words_filtered.tail()
### END SOLUTION

Unnamed: 0,Word,Count,Frequency
219,give,30,0.052882
220,name,30,0.052882
221,good,30,0.052882
222,went,30,0.052882
223,mull,30,0.052882


$4.$ Truncate the numbers in the `Frequency` column of `df_words_filtered` to two decimal places. 

(Your solution should not raise a `SettingWithCopyWarning` warning. If it does, call:

```python
df_words_filtered = df_words_filtered.copy()
```

In [32]:
### BEGIN SOLUTION 
df_words_filtered = df_words_filtered.copy()
df_words_filtered['Frequency'] = df_words_filtered['Frequency'].round(2)
df_words_filtered.tail()
### END SOLUTION 

Unnamed: 0,Word,Count,Frequency
219,give,30,0.05
220,name,30,0.05
221,good,30,0.05
222,went,30,0.05
223,mull,30,0.05


$5.$ Output `df_words_filtered` to `word_counts.tsv`. Ensure that the separator between columns is a tab. Do not include the index as an additional column.

In [33]:
### BEGIN SOLUTION 
df_words_filtered.to_csv('word_counts.tsv', sep='\t', index=False)
### END SOLUTION 

In [34]:
### TESTS
with open('word_counts.tsv', "r") as f:
    line = f.readlines()[1]
assert line == 'the\t3822\t6.74\n'