## Bringing It All Together: Data Analysis and Visualization in Python

## Introduction

Okay, we've covered a lot in this section. We have learned how to work with lists, dictionaries, functions, and even visualizing data with libraries like folium and plotly. In this lab, we will bring all these skills together and flex our new found data science skills. As burgeoning data scienctists in the tech community, we are going to need to find ways to grow our network and find ways to learn more skills. Meetup is a great site for finding groups catered to particular interests and industries. This is a great way for us to expand our network and find events to learn more about how python and data science is being used in the marketplace. 

Our goal is to visualize data from the Meetup API to show which cities have the most active Python and Data Science communities. Let's get started!

## Objectives

* Understand how to programmatically format outside data
* Choose proper datatypes and formats for comparing data and creating visualizations
* Use higher level iterators like 'map' and 'filter'
* Write re-usable functions to make [DRY code](https://stackoverflow.com/questions/6453235/what-does-damp-not-dry-mean-when-talking-about-unit-tests)

## Formatting Our Data

Okay, so first things first, we have to figure out where our data is coming from. Typically, we would query an API, which acts like a database for a website. APIs can contain information for any number of things. In the example of meetup, the API has all the information on every group, event, all member profiles, the interests for those members, etc. Once we successfully query the information we want, we can format it and use it for our own purposes. To make it easier, we've already queried the meetup API for our data and saved it in this repository. We can now simply import the data and begin to work with it. 

We have gathered information for Python focused meetup groups in a 15 mile radius of New York, San Francisco, Denver, Austin, and Atlanta.

Let's take a look at what one of our groups looks like:

In [1]:
from groups import data

In [2]:
data[0]

{'utc_offset': -14400000,
 'country': 'US',
 'visibility': 'public',
 'city': 'New York',
 'timezone': 'US/Eastern',
 'created': 1150197645000,
 'topics': [{'urlkey': 'opensource', 'name': 'Open Source', 'id': 563},
  {'urlkey': 'python', 'name': 'Python', 'id': 1064},
  {'urlkey': 'softwaredev', 'name': 'Software Development', 'id': 3833},
  {'urlkey': 'django', 'name': 'Django', 'id': 10553},
  {'urlkey': 'web-development', 'name': 'Web Development', 'id': 15582},
  {'urlkey': 'computer-programming',
   'name': 'Computer programming',
   'id': 48471},
  {'urlkey': 'python-web-development',
   'name': 'Python Web Development',
   'id': 917242}],
 'link': 'https://www.meetup.com/nycpython/',
 'rating': 4.64,
 'description': '<p>Meet other local Python Programming Language enthusiasts!</p>',
 'lon': -73.98999786376953,
 'group_photo': {'highres_link': 'https://secure.meetupstatic.com/photos/event/c/a/1/4/highres_313251732.jpeg',
  'photo_id': 313251732,
  'base_url': 'https://secure.mee

Wow, that is a lot of information! We can see that the first element of the `data` list is a dictionary representing one of the groups we've pulled from the Meetup API. Each group will take this same form, that is they will have the same keys (i.e. description, state, city, lat, lon, category, topics, etc.) We can more easily see a list of keys by using the keys() method we've used in earlier labs. 

In [3]:
data[0].keys()

dict_keys(['utc_offset', 'country', 'visibility', 'city', 'timezone', 'created', 'topics', 'link', 'rating', 'description', 'lon', 'group_photo', 'join_mode', 'organizer', 'members', 'name', 'id', 'state', 'urlname', 'category', 'lat', 'who'])

Great! we have an idea of what we're working with now. How can we use this data to answer some of our questions from above? Let's start with finding out which cities have the most python groups. 

We can do this by creating a dictionary and giving it keys for each city in our list of groups that we have imported as the variable, `data`.

Let's write a function that will do this for us and call it `create_cities_dictionary`, which should return a dictionary with keys for each city that point to lists with all the meetup groups in that city.

In [5]:
def create_cities_dictionary(group_data):
    city_dict = {}
    for group in group_data:
        if group["city"] in city_dict:
            city_dict[group["city"]].append(group)
        else: 
            city_dict[group["city"]] = [group]
    return city_dict

cities_dict = create_cities_dictionary(data)
cities_dict # call our create_cities_dictionary function here to assign its return value to the variable cities_dict

{'New York': [{'utc_offset': -14400000,
   'country': 'US',
   'visibility': 'public',
   'city': 'New York',
   'timezone': 'US/Eastern',
   'created': 1150197645000,
   'topics': [{'urlkey': 'opensource', 'name': 'Open Source', 'id': 563},
    {'urlkey': 'python', 'name': 'Python', 'id': 1064},
    {'urlkey': 'softwaredev', 'name': 'Software Development', 'id': 3833},
    {'urlkey': 'django', 'name': 'Django', 'id': 10553},
    {'urlkey': 'web-development', 'name': 'Web Development', 'id': 15582},
    {'urlkey': 'computer-programming',
     'name': 'Computer programming',
     'id': 48471},
    {'urlkey': 'python-web-development',
     'name': 'Python Web Development',
     'id': 917242}],
   'link': 'https://www.meetup.com/nycpython/',
   'rating': 4.64,
   'description': '<p>Meet other local Python Programming Language enthusiasts!</p>',
   'lon': -73.98999786376953,
   'group_photo': {'highres_link': 'https://secure.meetupstatic.com/photos/event/c/a/1/4/highres_313251732.jpeg',
  

Neat! We've started to organize our data and are on our way to extracting some tasty analyses! Let's look first at what the keys are of our newly created dictionary. How can we do this, again?

In [8]:
cities_keys = list(cities_dict.keys()) # assign the keys from the cities_dict to the variable cities_keys
cities_keys

['New York',
 'Bayside',
 'Brooklyn',
 'Yonkers',
 'Forest Hills',
 'San Francisco',
 'Daly City',
 'Berkeley',
 'Oakland',
 'San Mateo',
 'Alameda',
 'Denver',
 'Englewood',
 'Austin',
 'Atlanta']

Looks like we've picked up a couple of extra cities! So far, we know that all of these cities have at least one group that focuses on Python. Let's dig and figure out how these cities compare.

Write a method that returns a list of values which represent the number of groups in each city. Let's call it `tabulate_num_groups`. It should take a parameter of a dictionary, in this case our cities_dict, and return a list of the number of groups each city has.

> **Hint:** to make this more programmatic, you might think about using one of the functions we introduced that iterates over a collection, operates on each element and returns a list of the same length.

In [429]:
def count_groups(list):
    return len(list)

def tabulate_num_groups(dictionary):
    return map(count_groups, dictionary.values())
        
group_count_list = tabulate_num_groups(cities_dict) # assign your list of group counts to the variable group_count_list

Awesome, we've created a list containing the number of groups in each city. Let's start think about how we can visualize this data and start making some assumptions. Since we have lists of data, we should be in good shape to make a bar graph. Let's try it out!

In [431]:
import plotly
from plotly.offline import iplot, init_notebook_mode
from plotly import tools
import plotly.graph_objs as go
init_notebook_mode(connected=True)

trace = {'type': 'bar', 'x': list(cities_dict.keys()), 'y': list(group_count_list)}

plotly.offline.iplot({'data': [trace]})

Quite a close call, but it looks like New York and San Francisco take the top two spots -- who would have thought? 

Okay, this is a start in the right direction, but can't anyone start a meetup group? What if there's only a few people in these groups? We have to start thinking about how we can be more critical about our data.

Let's start with widdling down our list of groups so that it only includes groups that have over 300 members -- this should be substantial enough to ensure that the group is active and has a large network. It's a naive approach, but it should give us some better insights into which cities have the most active groups. 

We can write a function that will select only the groups that have `members` in excess of 300. Let's call it `get_active_groups`. It should take in our list of groups and return a pared down list with just the groups that have memberships above 300. 

> **Hint:** This is good opportunity to again make our code more programmatic and use a function that takes in a list and iterates over each element selecting only the elements that pass our condition.

In [432]:
def big_groups(group):
    return group['members'] > 300

def get_active_groups(data):
    return list(filter(big_groups, data))

active_groups_list = get_active_groups(data)

None # assign the list of active groups to the variable active_groups_list

Alright, lets look at how many groups we were able to filter out of our initial list, `data`:

In [433]:
# print(len(data) - len(active_groups_list))

Alright, so, we removed 63 results from our list. Now we can re-chart our list and see what is looks like now. Before we do that, we need to recreate our dictionary from before. Good thing we have our function, `create_cities_dictionary` that creates the type of dictionary we need for us! Let's call this new dictionary `active_cities_dict`.

In [434]:
active_cities_dict = create_cities_dictionary(active_groups_list)
print(len(active_cities_dict['San Francisco']))

44


Alright, we've got our updated dictionary and can now think about re-charting our data. But we also need to update our number of groups -- but we've got that part covered too with our previously written functions. Let's tabulate the group count again and then get our updated chart.

In [435]:
active_group_count_list = list(tabulate_num_groups(active_cities_dict))

In [436]:
import plotly
from plotly.offline import iplot, init_notebook_mode
from plotly import tools
import plotly.graph_objs as go
init_notebook_mode(connected=True)

trace = {'type': 'bar', 'x': list(active_cities_dict.keys()), 'y': active_group_count_list}

plotly.offline.iplot({'data': [trace]})

It's a dead heat! New York, well Manhattan, and San Francisco are tied. So, we've pared down our list and our results are still pretty similar to before, but we have a better idea of which cities have the most active Python communities. 

Let's take our analysis a step further. We have decided data science is the life we want to live and we now want to find where we want to live and work once we've completed this course. Another naive, but useful approach would be to look at where these groups are located. Remember the maps we used earlier in this section? We're going to bring those back!

Below, we have set up the map with a view of the US:

In [440]:
import folium
python_map = folium.Map([40.342140, -99.699511], zoom_start = 4)
python_map

Okay, very cool but boring. Let's start adding markers to our map.
We can make markers by calling folium.Marker and passing in a list with the latitude and longitude of the thing we are trying to mark (e.g. `folium.Marker([latitude], [longitude])`). Let's write a function that takes in our active groups and adds a marker to our map. This function should be called `add_markers` and take in two parameters; a list of groups and the map to which you would like to add the marker.

In [459]:
def add_markers(group_list, selected_map):
    for group in group_list:
        folium.Marker([group['lat'], group['lon']]).add_to(selected_map)

add_markers(active_groups_list, python_map)

python_map

Nice! We ploted all the markers on the map. Now we can really visualize where all the python hubs are. From our chart, we can tell that the action is heavily concentrated in California and New York. So, we can gather that if we want to position ourselves to have a strong network of fellow data scienctists and Python lovers, we'll need to consider living in one of these locations. Let's make two maps -- one map for the Python groups in New York and then one for the Python groups in California. 
Let's go back to our original list of groups and select only the ones whose state is the state we want. We can again use a function that iterates over a collection and only selects the elements that pass our condition. We will call it `select_groups_from_state`. It should take in a parameter which is the desired state and it should return a list of groups that are located in that state.

> **Hint:** The `state` key points to the initials for that state (e.g. NY for New York & CA for California)

In [449]:
def filter_group_by_state(group, state):
    return group['state'] == state

def select_groups_from_state(state):
    return list(filter(lambda group: filter_group_by_state(group, state), data))

Okay, this is great. We now have a way to capture only the events from a specified state. Let's map them now, below we've supplied two maps that will center over NY and CA. Remember, we've set ourselves up so that we can easily add markers to a map with another function we've written...

In [472]:
CA_map = folium.Map([37.679238, -122.273028], zoom_start = 10)
NY_map = folium.Map([40.677887, -74.026795], zoom_start = 10)

In [473]:
add_markers(select_groups_from_state("CA"), CA_map)
CA_map

In [474]:
add_markers(select_groups_from_state("NY"), NY_map)
NY_map

Woah! Okay, so we've got another piece of data to add to our analysis. It looks like we can confidently pick either San Francisco or New York to live our new lives as data scientists with north easter San Francisco and Manhattan having the most Python groups.

### Summary

In this lab, we used our skills to map, filter, chart, and plot data. We wrote functions to parse through hundreds of Meetup groups to find the ones that gave us the most relevant information. Finally we consolidated our data and created visualizations to help us make decisions based on this data.