# Pandas | List Comprehensions and Lambda Functions
--- 
## Concepts:
- How to read and work with __`JSON`__ data.
- How to use __list comprehensions__ to extract specific values from `JSON` objects
- Some of the theory behind passing functions as arguments.
- How to create single-use __`lambda`__ functions.
- How to use __`lambda`__ functions in __`pandas`__ to extract tags from Hacker News stories.

In [1]:
import numpy as np
import pandas as pd
import re 
# import seaborn as sns
# import matplotlib.pyplot as plt
# %matplotlib inline

### Instructions

We have created a JSON string, __`world_cup_str`__, which contains data about games from the 2018 Football World Cup.

1. Import the __`json module`__.
2. Use __`json.loads()`__ to convert __`world_cup_str`__ to a Python object. Assign the result to __`world_cup_obj`__.

In [2]:
world_cup_str = """
[
    {
        "team_1": "France",
        "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
    },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }
]
"""

import json
world_cup_obj = json.loads(world_cup_str)

In [3]:
import json
file = open("hn_2014.json")
hn = json.load(file)

print(type(hn))

<class 'list'>


Our `hn` variable is a list. Let's find out how many objects are in the list, and the type of the first object (which will almost always be the type of every object in the list in JSON data):

In [4]:
print(len(hn))
print(type(hn[0]))

35806
<class 'dict'>


### Instructions 
The json module remains imported from the previous screen.

1. Use the `open()` function to open the `hn_2014.json` file as a file object.
2. Use the `json.load()` function to parse the file object and assign the result to `hn`.

In [5]:
file = open("hn_2014.json")
hn = json.load(file)

In [6]:
def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

first_story = hn[0]
jprint(first_story)

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "createdAtI": 1401350870,
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


### Instructions

We have provided the code for the `del_key()` function.

1. Create an empty list, `hn_clean` to store the cleaned data set.
2. Loop over the dictionaries in the `hn` list. In each iteration:
3. Use the `del_key()` function to delete the `createdAtI` key from the dictionary.
4. Append the cleaned dictionary to `hn_clean`.

    - https://docs.python.org/3.7/reference/simple_stmts.html#del – `del`

In [7]:
def del_key(dict_, key):
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict

hn_clean = []

for i in hn:
    hn_clean.append(del_key(i, 'createdAtI'))

In [8]:
hn_clean[:2]

[{'author': 'dragongraphics',
  'numComments': 0,
  'points': 2,
  'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
  'storyText': '',
  'createdAt': '2014-05-29T08:07:50Z',
  'tags': ['story', 'author_dragongraphics', 'story_7815238'],
  'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
  'objectId': '7815238'},
 {'author': 'jcr',
  'numComments': 0,
  'points': 1,
  'url': 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
  'storyText': '',
  'createdAt': '2014-05-29T08:05:58Z',
  'tags': ['story', 'author_jcr', 'story_7815234'],
  'title': 'Telemba Turns Your Old Roomba and Tablet Into a Telepresence Robot',
  'objectId': '7815234'}]

### Instructions

We've provided the solution for the previous screen in comments for you to use as a reference.

1. Create a list comprehension representation of the loop from the previous screen:
    - Call the __`del_key()`__ function to remove the createdAtI value from each dictionary in the `hn` list.
2. Assign the results to a new list, `hn_clean.`

In [9]:
hn_clean_2 = [del_key(dictionary, 'createdAtI') for dictionary in hn]

In [10]:
hn_clean_2[:1]

[{'author': 'dragongraphics',
  'numComments': 0,
  'points': 2,
  'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
  'storyText': '',
  'createdAt': '2014-05-29T08:07:50Z',
  'tags': ['story', 'author_dragongraphics', 'story_7815238'],
  'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
  'objectId': '7815238'}]

### Instructions

1. Use a list comprehension to extract the `url` value from each dictionary in `hn_clean`. Assign the result to `urls`.

In [11]:
urls = []
urls = [a['url'] for a in hn_clean]

###  Instructions

1. Use list comprehension to create a new list, `thousand_points`:
    - The list should contain values from `hn_clean` where the `points` key has a value greater than `1000`.
2. Count the number of values in `thousand_points` and assign the result to `num_thousand_points`.

In [12]:
thousand_points = [i for i in hn_clean if i['points'] > 1000]
num_thousand_points = len(thousand_points)
thousand_points[:2]

[{'author': 'keithwarren',
  'numComments': 451,
  'points': 1297,
  'url': 'http://roslyn.codeplex.com/',
  'storyText': '',
  'createdAt': '2014-04-03T16:48:14Z',
  'tags': ['story', 'author_keithwarren', 'story_7524082'],
  'title': 'Microsoft Open Sources C# Compiler',
  'objectId': '7524082'},
 {'author': 'zipop',
  'numComments': 403,
  'points': 1192,
  'url': 'http://www.teslamotors.com/blog/people-new-jersey',
  'storyText': '',
  'createdAt': '2014-03-14T19:05:37Z',
  'tags': ['story', 'author_zipop', 'story_7401029'],
  'title': 'Elon Musk: To the People of New Jersey',
  'objectId': '7401029'}]

### Instructions

1. Create a "key function" that accepts a single dictionary and returns the value from the `numComments` key.
2. Use the `max()` function with the "key function" you just created to find the value from the `hn_clean` list with the most comments:
    - Assign the result to the variable `most_comments`.

In [13]:
def get_nc(df):
    return df['numComments']

In [14]:
most_comments = max(hn_clean, key=get_nc)

### Instructions

1. In the display code, we have defined (in comments) a function `multiply()` using traditional syntax.
2. Create a lambda function that performs the same operation. Assign it to the variable name `multiply`.

In [15]:
# def multiply(a, b):
#    return a * b

multiply = lambda a, b: a * b

### Instructions

1. Using `sorted()` and a `lambda` function, sort the `hn_clean` JSON list by the number of points (dictionary key `points`) from highest to lowest:
    - Check the documentation for `sorted()` to see how to reverse the order to highest to lowest.
    https://docs.python.org/3.7/library/functions.html#sorted
    - Assign the result to `hn_sorted_points`.
2. Use a list comprehension to return a list of the five post titles (dictionary key `title`) that have the most points in our data set:
     Assign the result to `top_5_titles`.

In [16]:
hn_sorted_points = sorted(hn_clean, key=lambda d: d['points'], reverse=True)

In [17]:
top_5_titles = [i['title'] for i in hn_sorted_points[:5]]

In [18]:
top_5_titles

['2048',
 'Today is The Day We Fight Back',
 'Wozniak: “Actually, the movie was largely a lie about me”',
 'Microsoft Open Sources C# Compiler',
 'Elon Musk: To the People of New Jersey']

### Instructions

1. Import the pandas library.
2. Use the __`pandas.DataFrame()`__ constructor to create a dataframe version of the `hn_clean` JSON list. Assign the result to `hn_df`.

In [19]:
hn_df = pd.DataFrame(hn_clean)

### Instructions

1. Use __`Series.apply()`__ and __`len()`__ to create a boolean mask based on whether each item in tags has a length of 4.
2. Use the boolean mask to filter tags. Assign the result to __`four_tags`__.

In [20]:
tags = hn_df['tags']

four_tags = tags[tags.apply(len) == 4]

### Instructions

We have provided a function that uses a ternary operator to provide the logic to extract the tags.

1. Use __`Series.apply()`__ and a __`lambda`__ function to extract the tag data from tags:
    - Where the item is a list with length four, return the last item.
    - In all other cases, return `None`.
2. Assign the result to `cleaned_tags`.
3. Assign the `cleaned_tags` series to the `tags` column of the hn_df dataframe.

In [21]:
# def extract_tag(l):
#     return l[-1] if len(l) == 4 else None

cleaned_tags = hn_df['tags'].apply(lambda tag : tag[-1] if len(tag) == 4 else None)

hn_df['tags'] = cleaned_tags

### _Experementing with a code:_
---

In [22]:
exp_list = ['abc', 'abc', 'abc', 'abc', 'abc', 'abc']

for i in range(0, 6):
    exp_list[i] = '{}{}'.format(exp_list[i], i)
exp_list

['abc0', 'abc1', 'abc2', 'abc3', 'abc4', 'abc5']

In [23]:
exp_list_upd = ['{}{}'.format(exp_list[i][:3] , (i+1) * 2.75) for i in range(0, 6)]
exp_list_upd

['abc2.75', 'abc5.5', 'abc8.25', 'abc11.0', 'abc13.75', 'abc16.5']