# DSC 80: Lab 01

### Due Date: Tuesday April 07, Midnight (11:59 PM)

## Zoom Lab Hours
- Follow instructions on this link: https://docs.google.com/document/d/16qZpPSYhxwQDMcn-lGQjC-J-PzppLevv_mANLt2ko8g/edit 

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab01.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab01.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import lab01 as lab

In [3]:
import os
import pandas as pd
import numpy as np

## Python Basics

---
**Question 0 (EXAMPLE):**

Write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two adjacent list elements that are consecutive integers.
* Otherwise, returns `False`.

For example, because `9` is next to `8`:
```
>>> lab.consecutive_ints([5,3,6,4,9,8])
True
```
Whereas:
```
>>> lab.consecutive_ints([1,3,5,7,9])
False
```

*Note*: This question is done for you, to demonstrate a completed homework problem.

In [4]:
def consecutive_ints(ints):
    """
    consecutive_ints tests whether a list contains two 
    adjacent elements that are consecutive integers.

    :param ints: a list of integers
    :returns: a boolean value if ints contains two 
    adjacent elements that are consecutive integers.

    :Example:
    >>> consecutive_ints([5,3,6,4,9,8])
    True
    >>> consecutive_ints([1,3,5,7,9])
    False
    """

    if len(ints) == 0:
        return False

    for k in range(len(ints) - 1):
        diff = abs(ints[k] - ints[k+1])
        if diff == 1:
            return True

    return False


In [5]:
# Add more cells if you'd like!

Test your code in two ways:
1. Run the cell below to test your code. You should also copy the cell and change the input to test further (i.e. write your own doctests)! Does it work for corner cases? Real-world data is **very messy** and you should expect your data processing code to break without thorough testing!
2. Run doctests on `lab01.py` by running the following command on the commandline:
```
python -m doctest lab01.py
```
If the doctests pass, then there should be *no* output.

In [6]:
# test your code!
lab.consecutive_ints([1,3,2,4])

True

In [7]:
lab.consecutive_ints([0])

False

In [8]:
lab.consecutive_ints([])

False

---
**Question 1 (median):**

Write a function called *median* that takes a non-empty list of numbers, returning the median element of the list. If the list has even length, it should return the mean of the two elements in the middle. Do not use any imported libraries for this question; you may use any built-in function.


In [9]:
def median(nums):
    """
    median takes a non-empty list of numbers,
    returning the median element of the list.
    If the list has even length, it should return
    the mean of the two elements in the middle.

    :param nums: a non-empty list of numbers.
    :returns: the median of the list.
    
    :Example:
    >>> median([6, 5, 4, 3, 2]) == 4
    True
    >>> median([50, 20, 15, 40]) == 30
    True
    >>> median([1, 2, 3, 4]) == 2.5
    True
    """
    
    size = len(nums)
    nums.sort()
    
    if size % 2 == 1:
        med = nums[size // 2]
    elif size % 2 == 0:
        med1 = nums[size // 2]
        med2 = nums[(size // 2) - 1]
        med = (med1 + med2) / 2
    
    return med

In [10]:
test1 = [6, 5, 4, 3, 2]

In [11]:
size = len(test1)
size
print(size//2)

2


In [12]:
median([1, 2, 3, 4]) == 2.5

True

In [13]:
# Try this
lab.median([0, -1, 1, 100])

Ellipsis

---
**Question 2 (List Distances):**

Similar to Question 0, write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two list elements $i$ places apart, whose distance as integers is also $i$.
* Otherwise, returns `False`.

Assume your inputs tend to satisfy the condition, and the pair(s) saitifying the condition tend to be close together; design your function to run faster for this case. (Optimizing your code for an assumed distribution of incoming data is very common in data science).

For example, because `3` and (the second) `5` are two places apart, and $|3-5| = 2$:
```
>>> lab.same_diff_ints([5,3,1,5,9,8])
True
```
Whereas:
```
>>> lab.same_diff_ints([1,3,5,7,9])
False
```

*Note*: Make sure to define some extreme test cases. Use the `%time` command to time your function!

In [99]:
def same_diff_ints(ints):
    """
    same_diff_ints tests whether a list contains
    two list elements i places apart, whose distance
    as integers is also i.

    :param ints: a list of integers
    :returns: a boolean value if ints contains two
    elements as described above.

    :Example:
    >>> same_diff_ints([5,3,1,5,9,8])
    True
    >>> same_diff_ints([1,3,5,7,9])
    False
    """
    size = len(ints)
    
    for i in range(0, size):
        for j in range(i + 1, size):
            diff_idx = abs(i - j)
            diff_vals = abs(ints[i] - ints[j])
            if diff_idx == diff_vals:
                return True
            
    return False

In [100]:
same_diff_ints([5,3,1,5,9,8])

True

In [101]:
same_diff_ints([1,3,5,7,9])

False

In [102]:
%time lab.same_diff_ints([5,3,1,5,9,8])

CPU times: user 12 µs, sys: 1 µs, total: 13 µs
Wall time: 41.2 µs


True

---
## Strings and Files

The following questions will help you (re)learn the basics of working with strings and reading data from files (which are read in as strings, by default).

---
**Question 3 (Prefixes):**

Write a function `prefixes` that takes a string and returns a string of every consecutive prefix of the input string. For example, `prefixes('Data!')` should return `'DDaDatDataData!'`.  (See the doctests for more examples).

Recall that [strings may be sliced](https://docs.python.org/3/tutorial/introduction.html#strings), like lists.


In [18]:
def prefixes(s):
    """
    prefixes returns a string of every 
    consecutive prefix of the input string.

    :param s: a string.
    :returns: a string of every consecutive prefix of s.

    :Example:
    >>> prefixes('Data!')
    'DDaDatDataData!'
    >>> prefixes('Marina')
    'MMaMarMariMarinMarina'
    >>> prefixes('aaron')
    'aaaaaraaroaaron'
    """

    size = len(s)
    word = s[0]
    for i in range(1, size):
        to_add = s[0:i+1]
        word = word + to_add        
        
    return word

In [19]:
prefixes('Data!') == 'DDaDatDataData!'


True

In [20]:
#prefixes('Marina') == 'MMaMarMariMarinMarina'
prefixes('aaron') == 'aaaaaraaroaaron'

True

In [21]:
text = 'Data!'
s = len(text)
word = text[0]
print(word)
for i in range(1, s):
    to_add = text[0:i+1]
    print(to_add)
    word = word + to_add
word

D
Da
Dat
Data
Data!


'DDaDatDataData!'

---
**Question 4 (Evens reversed):**

Write a function `evens_reversed` that takes in a non-negative integer $N$ and returns a string containing all even integers from $1$ to $N$ (inclusive) in reversed order, separated by spaces. Additionally, [zero pad](https://www.tutorialspoint.com/python/string_zfill.htm) each integer, so that each has the same length.

In [22]:
def evens_reversed(N):
    """
    evens_reversed returns a string containing 
    all even integers from  1  to  N  (inclusive)
    in reversed order, separated by spaces. 
    Each integer is zero padded.

    :param N: a non-negative integer.
    :returns: a string containing all even integers 
    from 1 to N reversed, formatted as decsribed above.

    :Example:
    >>> evens_reversed(7)
    '6 4 2'
    >>> evens_reversed(10)
    '10 08 06 04 02'
    """
    reversed_list = ''
    size = len(str(N))
    
    even_odd = N % 2
    if even_odd == 1:
        N = N - 1
    
    for i in range(N, 0, -2):
        num = str(i)
        diff = size - len(num)
        for i in range(diff):
            num = '0' + num
        reversed_list = reversed_list + ' ' + num
    
    return reversed_list[1:]

In [23]:
evens_reversed(1000)

'1000 0998 0996 0994 0992 0990 0988 0986 0984 0982 0980 0978 0976 0974 0972 0970 0968 0966 0964 0962 0960 0958 0956 0954 0952 0950 0948 0946 0944 0942 0940 0938 0936 0934 0932 0930 0928 0926 0924 0922 0920 0918 0916 0914 0912 0910 0908 0906 0904 0902 0900 0898 0896 0894 0892 0890 0888 0886 0884 0882 0880 0878 0876 0874 0872 0870 0868 0866 0864 0862 0860 0858 0856 0854 0852 0850 0848 0846 0844 0842 0840 0838 0836 0834 0832 0830 0828 0826 0824 0822 0820 0818 0816 0814 0812 0810 0808 0806 0804 0802 0800 0798 0796 0794 0792 0790 0788 0786 0784 0782 0780 0778 0776 0774 0772 0770 0768 0766 0764 0762 0760 0758 0756 0754 0752 0750 0748 0746 0744 0742 0740 0738 0736 0734 0732 0730 0728 0726 0724 0722 0720 0718 0716 0714 0712 0710 0708 0706 0704 0702 0700 0698 0696 0694 0692 0690 0688 0686 0684 0682 0680 0678 0676 0674 0672 0670 0668 0666 0664 0662 0660 0658 0656 0654 0652 0650 0648 0646 0644 0642 0640 0638 0636 0634 0632 0630 0628 0626 0624 0622 0620 0618 0616 0614 0612 0610 0608 0606 0604 0602

---

[Recall](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) that the built-in function `open` takes in a file path and returns *a file object* (sometimes called a *file handle*). Below are a few properties of file objects:

* `open(path)` opens the file at location `path` for reading.
* `open(path)` is an *iterable*, which contains successive lines of the file.
* Once a file object is opened, after use it should be closed to avoid memory leaks. To ensure a file is closed once done, you should use a *context manager* as follows:
```
with open(path) as fh:
    for line in fh:
        process_line(line)
```
* To read the entire file into a string, use the read method:
```
with open(path) as fh:
    s = fh.read()
```
However, you should be careful when reading an entire file into memory that the file isn't too big! *You should avoid this whenever possible!*

**Question 5 (Reading Files):**

Create a function `last_chars` that takes a file object and returns a string consisting of the last character of the line.

*Remark:* A newline is the "delimiter" of the lines of a file, and doesn't count as part of the line (as the tests imply). Every other character is part of the line. For more info on this, see [the interpretation](https://en.wikipedia.org/wiki/Newline#Interpretation) of files as a 'newline delimited variables' file.



In [175]:
def last_chars(fh):
    """
    last_chars takes a file object and returns a 
    string consisting of the last character of the line.

    :param fh: a file object to read from.
    :returns: a string of last characters from fh

    :Example:
    >>> fp = os.path.join('data', 'chars.txt')
    >>> last_chars(open(fp))
    'hrg'
    """
    txt = ''
    
    with fh as fg:
        for line in fg:
            last = line[-2]
            txt = txt + last
    
    return txt

In [176]:
fp = os.path.join('data', 'chars.txt')
last_chars(open(fp))


On a branch



floating downriver



a cricket, singing





'hrg'

---

## `numpy` exercises

For an introduction to arrays and `numpy` recall the relevant section of [DSC 10](https://www.inferentialthinking.com/chapters/05/1/Arrays.html).

**Question 6 (Basic Arrays):**

Create the following functions using `numpy` methods satisfying the requirements given in each part. Your solutions should **not** contain any loops or list comprehensions.

* A function `arr_1` that takes in a `numpy` array and adds to each element the square-root of the index of each element.

* A function `arr_2` that takes in a `numpy` array of integers and returns a boolean array (i.e. an array of booleans) whose `ith` element is `True` if and only if the `ith` element of the input array is divisble by 16.

* A function `arr_3` that takes in a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share on successive days in USD and returns an array of growth rates. That is, the `ith` number of the output array should contain the rate of growth in stock price between the $i^{th}$ day to the $(i+1)^{th}$ day. The growth rate should be a proportion, rounded to the nearest hundredth.

* Suppose:
    - `A` is a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share for a company on successive days in USD 
    - you start with \\$20, and put aside \\$20 at the end of each day to buy as much stock as possible the following day. 
    - Any money left-over after a given day is saved for possibly buying stock on a future day. 
    - Create a function `arr_4` that takes in `A` and returns the day on which you can buy at least one share from 'left-over' money. If this never happens, return `-1`. The first stock purchase occurs on day 0. *Note: you cannot buy fractions of a share of stock*.
    
*Example:* If the stock price is \\$3 every day, then the answer is 'day 1':
* day 0: buy 6 shares; \\$2 left-over; \\$22 at end of day.
* day 1: buy 7 shares; \\$1 left-over; \\$21 at end of day.
This is more than the 6 shares that \\$20 can buy.

In [26]:
fp = os.path.join('data', 'stocks.csv')
stocks = np.array([float(x) for x in open(fp)])

In [27]:
def arr_1(A):
    """
    arr_1 takes in a numpy array and
    adds to each element the square-root of
    the index of each element.

    :param A: a 1d numpy array.
    :returns: a 1d numpy array.

    :Example:
    >>> A = np.array([2, 4, 6, 7])
    >>> out = arr_1(A)
    >>> isinstance(out, np.ndarray)
    True
    >>> np.all(out >= A)
    True
    """

    size = len(A)
    B = np.arange(size)
    B = B**0.5
    
    with_sqrt = A + B
    
    return with_sqrt

In [28]:
arr = np.array([2, 4, 6, 7])
len(arr)
b = np.arange(4)
print(b)
b**0.5

[0 1 2 3]


array([0.        , 1.        , 1.41421356, 1.73205081])

In [29]:
A = np.array([2, 4, 6, 7])
out = arr_1(A)
#isinstance(out, np.ndarray)
out
#np.all(out >= A)

array([2.        , 5.        , 7.41421356, 8.73205081])

In [30]:
def arr_2(A):
    """
    arr_2 takes in a numpy array of integers
    and returns a boolean array (i.e. an array of booleans)
    whose ith element is True if and only if the ith element
    of the input array is divisble by 16.

    :param A: a 1d numpy array.
    :returns: a 1d numpy boolean array.

    :Example:
    >>> out = arr_2(np.array([1, 2, 16, 17, 32, 33]))
    >>> isinstance(out, np.ndarray)
    True
    >>> out.dtype == np.dtype('bool')
    True
    """    

    return A % 16 == 0

In [31]:
out = arr_2(np.array([1, 2, 16, 17, 32, 33, 0]))
#isinstance(out, np.ndarray)
out.dtype == np.dtype('bool')

True

In [32]:
def arr_3(A):
    """
    arr_3 takes in a numpy array of stock
    prices per share on successive days in
    USD and returns an array of growth rates.

    :param A: a 1d numpy array.
    :returns: a 1d numpy array.

    :Example:
    >>> fp = os.path.join('data', 'stocks.csv')
    >>> stocks = np.array([float(x) for x in open(fp)])
    >>> out = arr_3(stocks)
    >>> isinstance(out, np.ndarray)
    True
    >>> out.dtype == np.dtype('float')
    True
    >>> out.max() == 0.03
    True
    """
    B = A[0:-1]
    C = A[1:]
    growth = (C - B) / B
    growth = np.round(growth, 2)

    return growth

In [33]:
A = np.array([7, 8, 9, 10, 11])


In [34]:
fp = os.path.join('data', 'stocks.csv')
stocks = np.array([float(x) for x in open(fp)])
out = arr_3(stocks)
isinstance(out, np.ndarray)
out.dtype == np.dtype('float')
out.max() == 0.03

True

In [35]:
def arr_4(A):
    """
    Create a function arr_4 that takes in A and 
    returns the day on which you can buy at least 
    one share from 'left-over' money. If this never 
    happens, return -1. The first stock purchase occurs on day 0
    :param A: a 1d numpy array of stock prices.
    :returns: an integer of the total number of shares.

    :Example:
    >>> import numbers
    >>> stocks = np.array([3, 3, 3, 3])
    >>> out = arr_4(stocks)
    >>> isinstance(out, numbers.Integral)
    True
    >>> out == 1
    True
    """
    twenty = 20
    shares = twenty // A
    spent = shares * A
    leftover = twenty - spent
    running_leftover = leftover.cumsum()
    when_to_purchase = running_leftover > A
    days = np.where(when_to_purchase == True)
    if days[0].size == 0:
        return -1
    day_to_purchase = min(min(days))

    return day_to_purchase

In [36]:
A = np.array([3, 3, 3, 3])
twenty = 20
shares = twenty // A
spent = shares * A
leftover = twenty - spent
running_leftover = leftover.cumsum()
when = running_leftover > A
days = np.where(when == True)
min(min(days))

1

In [37]:
import numbers
stocks = np.array([3, 3, 3, 3])
out = arr_4(stocks)
isinstance(out, numbers.Integral)
out == 1

array([ True, False, False])

---
## Getting Started with Pandas

The following questions will help you get comfortable with Pandas. These questions are similar to questions on tables in DSC 10; review the [textbook](https://www.inferentialthinking.com) as necessary. As always for Pandas questions:
1. Avoid writing loops through the rows of the dataset to do the problem, and
2. Test the output/correctness of your code with the help of the dataset given, but be sure your code will also run on data "like" the dataset given (sampling rows using the `.sample` method is useful for this!).

**Question 7 (Pandas basics):**

Read in the file `movies_by_year.csv` in the `data` directory and understand the dataset by answering the following questions. To do this, create a function `movie_stats` that takes in a dataframe like `movies` and returns a series containing the following statistics:
* The number of years covered by the dataset (`num_years`).
* The total number of movies made over all years in the dataset (`tot_movies`).
* The year with the fewest number of movies made; a tie should return the earliest year (`yr_fewest_movies`).
* The average amount of money grossed over all the years in the dataset (`avg_gross`).
* The year with the highest gross *per movie* (`highest_per_movie`).
* The name of the top movie during the second-lowest (total) grossing year (`second_lowest`).
* The average number of movies made the year *after* a Harry Potter movie was the #1 movie (`avg_after_harry`).

The index of the output series are given in parenthesis above.

*Note*: Your function should work on a dataset of the same format that contains information from other years. You may assume that none of the answers involving ranking returns a tie.

*Note*: To make sure your function still runs, in the event that one of the 7 parts throws an exception (e.g. due to a very incorrect answer), use `Try... Except...` structures.

In [40]:
movie_fp = os.path.join('data', 'movies_by_year.csv')
movies = pd.read_csv(movie_fp)
movies.head()

Unnamed: 0,Year,Total Gross,Number of Movies,#1 Movie
0,2015,11128.5,702,Star Wars: The Force Awakens
1,2014,10360.8,702,American Sniper
2,2013,10923.6,688,Catching Fire
3,2012,10837.4,667,The Avengers
4,2011,10174.3,602,Harry Potter / Deathly Hallows (P2)


In [93]:
def movie_stats(movies):
    """
    movies_stats returns a series as specified in the notebook.

    :param movies: a dataframe of summaries of
    movies per year as found in `movies_by_year.csv`
    :return: a series with index specified in the notebook.

    :Example:
    >>> movie_fp = os.path.join('data', 'movies_by_year.csv')
    >>> movies = pd.read_csv(movie_fp)
    >>> out = movie_stats(movies)
    >>> isinstance(out, pd.Series)
    True
    >>> 'num_years' in out.index
    True
    >>> isinstance(out.loc['second_lowest'], str)
    True
    """
    out_dict = {}
    
    try:
        min_year = movies['Year'].min()
        max_year = movies['Year'].max()
        num_years = max_year - min_year
        out_dict.update(num_years = num_years)
    except:
        out_dict = out_dict
        
    try: 
        tot_movies = movies['Number of Movies']
        out_dict.update(tot_movies = tot_movies)
    except:
        out_dict = out_dict
    
    try: 
        least_mov = movies[movies['Number of Movies'] == movies['Number of Movies'].min()]
        yr_fewest_movies = least_mov['Year'].min()
        out_dict.update(yr_fewest_movies = yr_fewest_movies)
    except:
        out_dict = out_dict
    
    try:
        avg_gross = movies['Total Gross'].mean()
        out_dict.update(avg_gross = avg_gross)
    except:
        out_dict = out_dict
        
    try:
        highest_gross = movies[movies['Total Gross'] == movies['Total Gross'].max()]
        highest_per_movie = highest_gross['Year'].min()
        out_dict.update(highest_per_movie = highest_per_movie)
    except:
        out_dict = out_dict

    try:
        by_gross = movies.sort_values('Total Gross').reset_index()
        second_lowest = by_gross['#1 Movie'][1]
        out_dict.update(second_lowest = second_lowest)
    except:
        out_dict = out_dict
    
    try:
        harry = movies[movies['#1 Movie'].str.contains('Harry Potter')]
        harry_years = harry['Year']
        year_after_harry = harry_years + 1
        year_after_harry = year_after_harry.tolist()
        movies_after_harry = movies[movies['Year'].isin(year_after_harry)]
        avg_after_harry = movies_after_harry['Number of Movies'].mean()
        out_dict.update(avg_after_harry = avg_after_harry)
    except:
        out_dict = out_dict

    return pd.Series(out_dict)

In [97]:
movie_fp = os.path.join('data', 'movies_by_year.csv')
movies = pd.read_csv(movie_fp)
out = movie_stats(movies)
#isinstance(out, pd.Series)
#'num_years' in out.index
#isinstance(out.loc['second_lowest'], str)

True

In [88]:
r_dict = {}
min_year = movies['Year'].min()
max_year = movies['Year'].max()
num_years = max_year - min_year
num_years
r_dict.update(num_years = num_years)
r_dict

{'num_years': 33}

In [44]:
tot_movies = movies['Number of Movies'].sum()
tot_movies

17834

In [47]:
least_mov = movies[movies['Number of Movies'] == movies['Number of Movies'].min()]
yr_fewest_movies = least_mov['Year'].min()
yr_fewest_movies

1990

In [49]:
avg_gross = movies['Total Gross'].mean()
avg_gross

7226.914705882354

In [51]:
highest_gross = movies[movies['Total Gross'] == movies['Total Gross'].max()]
highest_per_movie = highest_gross['Year'].min()
highest_per_movie

2015

In [63]:
by_gross = movies.sort_values('Total Gross').reset_index()
#by_gross.head()
second_lowest = by_gross['#1 Movie'][1]
second_lowest

'Back to the Future'

In [78]:
harry = movies[movies['#1 Movie'].str.contains('Harry Potter')]
harry_years = harry['Year']
harry_years
year_after_harry = harry_years + 1
year_after_harry = year_after_harry.tolist()
movies_after_harry = movies[movies['Year'].isin(year_after_harry)]
movies_after_harry
avg_after_harry = movies_after_harry['Number of Movies'].mean()
avg_after_harry

573.0

---

## CSV Files

**Question 8 (Reading malformed csv files):**

`malformed.csv` contains a file of comma-separated values, containing the following fields:


|column name|description|type|
|---|---|---|
|first|first name of person|str|
|last|last name of person|str|
|weight|weight of person (lbs)|float|
|height|height of person (in)|float|
|geo|location of person; comma-separated latitude/longitude|str|

Unfortunately, the entries contains errors that cause the Pandas `read_csv` function to fail parsing the file with the default settings. Instead, you must read in the file manually using Python's built-in `open` function.

Clean the csv file into a Pandas DataFrame with columns as described in the table above, by creating a function called `parse_malformed` that takes in a file path and returns a parsed, properly-typed dataframe. The dataframe should contain columns as described in the table above (with the specified types); it should agree with `pd.read_csv` when the lines are not malformed.


*Note:* Assume that the given csv file is a sample of a larger file; you will be graded against a **different** sample of the larger file that has the same type of parsing errors. That is, you should **not** hard-code your cleaning of the data to specific errors on specific lines in the data.

In [170]:
def parse_malformed(fp):
    """
    Parses and loads the malformed csv data into a 
    properly formatted dataframe (as described in 
    the question).

    :param fh: file handle for the malformed csv-file.
    :returns: a Pandas DataFrame of the data, 
    as specificed in the question statement.

    :Example:
    >>> fp = os.path.join('data', 'malformed.csv')
    >>> df = parse_malformed(fp)
    >>> cols = ['first', 'last', 'weight', 'height', 'geo']
    >>> list(df.columns) == cols
    True
    >>> df['last'].dtype == np.dtype('O')
    True
    >>> df['height'].dtype == np.dtype('float64')
    True
    >>> df['geo'].str.contains(',').all()
    True
    >>> len(df) == 100
    True
    >>> dg = pd.read_csv(fp, nrows=4, skiprows=10, names=cols)
    >>> dg.index = range(9, 13)
    >>> (dg == df.iloc[9:13]).all().all()
    True
    """
    rows = []
    with open(fp) as fl:
        cols = fl.readline().strip()
        cols = cols.split(',')
        for line in fl:
            word = line.replace('"', '') # take out " in the geo locations
            word = word.strip()
            entry = word.split(',')

            # there are some null (empty) values in the data so take them out
            if '' in entry:
                entry = ' '.join(entry).split()


            # put the geo location together and have it at the end of the list
            geo = entry[-2] + ',' + entry[-1]
            entry = entry[:-2]
            entry.append(geo)

            # add all entries to the row list
            rows.append(entry)

        # close file to make sure there is no memory leak
        fl.close()

        # create the data frame and set data types
        df = pd.DataFrame(rows, columns = cols)
        df['first'] = df['first'].astype(str)
        df['last'] = df['last'].astype(str)
        df['weight'] = df['weight'].astype(np.float64)
        df['height'] = df['height'].astype(np.float64)
        df['geo'] = df['geo'].astype(str)

    return df


In [171]:
fp = os.path.join('data', 'malformed.csv')
df = parse_malformed(fp)
cols = ['first', 'last', 'weight', 'height', 'geo']
list(df.columns) == cols
df['last'].dtype == np.dtype('O')
df['height'].dtype == np.dtype('float64')
df['geo'].str.contains(',').all()
len(df) == 100
dg = pd.read_csv(fp, nrows=4, skiprows=10, names=cols)
dg.index = range(9, 13)
(dg == df.iloc[9:13]).all().all()

True

In [151]:
fp = os.path.join('data', 'malformed.csv')
rows = []
with open(fp) as fl:
    cols = fl.readline().strip()
    cols = cols.split(',')
    print(cols)
    for line in fl:
        word = line.replace('"', '') # take out " in the geo locations
        word = word.strip()
        entry = word.split(',')
        # print(entry)
        
        # there are some null (empty) values in the data so take them out
        empty = ' '
        if '' in entry:
            entry = ' '.join(entry).split()
            
        
        # put the geo location together and have it at the end of the list
        geo = entry[-2] + ', ' + entry[-1]
        entry = entry[:-2]
        entry.append(geo)
        #print(entry)
    
        
        # add all entries to the row list
        rows.append(entry)
        
    # close file to make sure there is no memory leak
    fl.close()
    
    #print(rows)
    
    # create the data frame
    df = pd.DataFrame(rows, columns = cols)
    df['first'] = df['first'].astype(str)
    df['last'] = df['last'].astype(str)
    df['weight'] = df['weight'].astype(np.float64)
    df['height'] = df['height'].astype(np.float64)
    df['geo'] = df['geo'].astype(str)
    
    print(df.dtypes)
    
    

        

['first', 'last', 'weight', 'height', 'geo']
first      object
last       object
weight    float64
height    float64
geo        object
dtype: object


In [145]:
a1 = ['Anthony', 'Janaysia', '127.0', '', '77.0', '39.1, 93.6']
print(a1)
if '' in a1:
    print(a1)
    a1 = ' '.join(a1)
    print(a1)
    a1 = a1.split()


['Anthony', 'Janaysia', '127.0', '', '77.0', '39.1, 93.6']
['Anthony', 'Janaysia', '127.0', '', '77.0', '39.1, 93.6']
Anthony Janaysia 127.0  77.0 39.1, 93.6


## Congratulations! You're done!

* Submit the lab on Gradescope