# Loading Data from Files

Practice opening CSV and JSON file formats and loading them into sensible objects using the `csv` and `json` libraries and perform additional analysis

## CSV Files

Key data analysis tasks:
1. Counting
2. Measuring central tendency
3. Finding the maximum and minimum

In [1]:
import csv

In [12]:
with open("apple_orchard.csv") as f:
    apple_orchard_data = list(csv.DictReader(f))
    apple_tree_yields = [float(x["yield"]) for x in apple_orchard_data]
    print(f"Total apple tree yields: {len(apple_tree_yields)}")
    print(f"Average apple tree yields: {round(sum(apple_tree_yields) / len(apple_tree_yields), 2)}")
    print(f"Maximum apple tree yields: {round(max(apple_tree_yields), 2)}")
    print(f"Minimum apple tree yields: {round(min(apple_tree_yields), 2)}")


Total apple tree yields: 5000
Average apple tree yields: 42.41
Maximum apple tree yields: 65.55
Minimum apple tree yields: 21.93


## JSON Files

JavaScript object notation

This code loads a dataset of interactions btwn Twitter users

Each user is represented as a "node"

When one tweets at another, the connection is represented as a "link"


### Challenge:

Find the users who were:
1. "sources" (the user initiating a tweet that tags someone else) most often
2. "targets" (the user being tagged in a tweet) most often

In [16]:
import json

In [22]:
def build_freq_table(key, links):
    """
    build_freq_table() takes a key and list of dictionaries, and 
    returns the count of the key that appeared in the dictionaries
    """
    table = {}
    for link in links:
        user = link[key]
        table[user] = table.get(user, 0) + 1
    return table 


def print_top_5(table):
    """
    print_top_5() takes the sources or targets freg table, and returns
    the top 5 users and sorted by the number of tweets
    """
    for k, v in sorted(table.items(), key=lambda x:x[1], reverse=True)[:5]:
        print(f"User {k}\t| {v} Tweets")

with open("twitter_graph.json") as f:
    twitter_data = json.load(f)
    print(f"There are {len(twitter_data['nodes'])} users in this dataset")
    print()

    links = twitter_data["links"]
    sources = build_freq_table("source", links)
    targets = build_freq_table("target", links)

    print("Top 5 Sources:")
    print_top_5(sources)
    print()
    print("Top 5 Targets")
    print_top_5(targets)

There are 99 users in this dataset

Top 5 Sources:
User 232762581	| 23 Tweets
User 49076695	| 20 Tweets
User 523173553	| 19 Tweets
User 24883888	| 17 Tweets
User 53318310	| 16 Tweets

Top 5 Targets
User 169686021	| 13 Tweets
User 23642374	| 12 Tweets
User 25797630	| 11 Tweets
User 25626212	| 11 Tweets
User 21648607	| 10 Tweets
