# Day 18 Kata: Data Munging

A [code katas](http://codekata.com) is a short exercise for practicing particular Python skills.

This exercise is adapted from [Kata04](http://codekata.com/kata/kata04-data-munging/), by Dave Thomas.

**Due: Monday, November 7 at 12 noon**

### Part One: Weather Data

In `./data/weather.dat` you’ll find daily weather data for Morristown, NJ for June 2002.
Write a function that outputs the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column).

In [4]:
def find_day_with_smallest_spread():
    f = open('./data/weather.dat')
    f_df=f.readlines()
    lines = []
    temperature = []
    
    for line in f_df:
        lines.append(line.split())
    
    #remove unncessary rows - headers, empty rows, etc..
    lines.remove(lines[0])
    lines.remove(lines[0])
    lines.remove(lines[-1])
    
    for values in lines:
        day = values[0]
        max_temp = values[1]
        max_temp = max_temp.strip('*')
        min_temp = values[2]
        min_temp = min_temp.strip('*')
        temp_spread = float(max_temp) - float(min_temp)
        temperature.append([day, temp_spread])
    
    temperature.sort(key = lambda x: x[1])
    
    return temperature[0]
    
    

print find_day_with_smallest_spread()

['14', 2.0]


### Part Two: Soccer League Table
The file `./data/football.dat` contains the results from the English Premier League for 2001/2.
The columns labeled ‘F’ and ‘A’ contain the total number of goals scored for and against each team in that season (so Arsenal scored 79 goals against opponents, and had 36 goals scored against them).
Write a function to print the name of the team with the smallest difference in ‘for’ and ‘against’ goals.

In [7]:
def find_team_with_smallest_difference():
    f = open('./data/football.dat')
    f_df=f.readlines()
    lines = []
    goal_df = []
    
    for line in f_df:
        lines.append(line.split())
    
    #remove unncessary rows - headers, empty rows, etc..
    lines.remove(lines[0])
    lines.remove(lines[17])
  
    for values in lines:
        team = values[1]
        for_goal = values[6]
        aga_goal = values[8]
        goal_difference = float(for_goal) - float(aga_goal)
        goal_difference = abs(goal_difference)
        goal_df.append([team, goal_difference])
    
    goal_df.sort(key = lambda x: x[1])
    
    return goal_df[0]
    
print find_team_with_smallest_difference()

['Aston_Villa', 1.0]


### Part Three: DRY Fusion
“In software engineering, **d**on't **r**epeat **y**ourself (DRY) is a principle of software development, aimed at reducing repetition of information of all kinds” – [Wikipedia](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself).

(Compare this to [copy and paste programming](https://en.wikipedia.org/wiki/Copy_and_paste_programming).)

Take the two functions written previously and [factor out](https://en.wikipedia.org/wiki/Code_refactoring) as much common code as possible, leaving you with two smaller functions that each call a third function.

In [9]:
import string

def find_smallest_difference(source, list_of_indices_to_remove, column_index_of_observation, column_index_of_first_number, column_index_of_second_number):
    f = open(source)
    f_df=f.readlines()
    lines = []
    obs_list = []
    indices_to_remove = list_of_indices_to_remove
    
    for line in f_df:
        lines.append(line.split())
    
    for offset, index in enumerate(indices_to_remove):
        index -= offset
        del lines[index]
        
    for values in lines:
        obs = values[column_index_of_observation]
        first_number = values[column_index_of_first_number]
        first_number = first_number.strip(string.punctuation)
        second_number = values[column_index_of_second_number]
        second_number = second_number.strip(string.punctuation)
        difference = abs(float(first_number) - float(second_number))
        obs_list.append([obs, difference])
        
    obs_list.sort(key = lambda x: x[1])
    
    return obs_list[0]
    


print find_smallest_difference('./data/football.dat', [0, 18], 1, 6, 8)
print find_smallest_difference('./data/weather.dat', [0, 1, -1], 0, 1,2)

['Aston_Villa', 1.0]
['14', 2.0]


## Quick poll

About how long did you spend working on this Reading Journal?

4 hours

## Reading Journal feedback

Have any comments on this Reading Journal? Feel free to leave them below and we'll read them when you submit your journal entry. This could include suggestions to improve the exercises, topics you'd like to see covered in class next time, or other feedback.
If you have Python questions or run into problems while completing the reading, you should post them to Piazza instead so you can get a quick response before your journal is submitted.