# If you're writing a lot of code - you're doing it wrong


### Ryan Kazmerik
* Data Scientist, Encana Corporation
* Sessional Instructor, Mount Royal University
* Mount Royal University, Bachelor CIS (2011)
* Wilfrid Laurier University, Master MAC (2019)

In [2]:
# pip install pandas
# pip install altair
# pip install spacy
# python -m spacy download en_core_web_sm

## Let's start with our first data representation: Row based files

## For this we'll load up our data in CSV format, and use the built in Python library CSV to read the contents of the file:

In [3]:
import csv

with open('articles.csv',  encoding="utf8") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    
    for row in csv_reader:
        print(", ".join(row), end='\n\n')


id, publishedAt, description, source, title

3528564, 2019-02-23T15:10:34Z, CADILLAC, Mich. (AP) — Consumers Energy is planning to develop a solar power plant in the northwestern Michigan community of Cadillac., Associated Press, Consumers Energy planning solar power plant in Cadillac

3528565, 2018-06-03T18:06:40Z, Can climate change be solved with technologies like wind and solar energy? No, it can’t, according to a new report by two Google engineers, published by the Institute of Electrical and Electronics Engineers., Fox News, Google engineers say renewable energy won’t solve climate change

3528566, 2018-08-09T09:26:02Z, Fox News NASA's Parker Solar Probe set to 'touch the Sun' on historic mission Fox News CAPE CANAVERAL – NASA is set to launch its Parker Solar Probe Saturday on a historic mission that will “touch the Sun.” The solar probe will be the first spacecraft to fly …, Fox News, NASA's Parker Solar Probe set to 'touch the Sun' on historic mission - Fox News

3528567, 2018

## CSV is a great storage format, compact, and readable - but a little clumsy to work with.

## Let's convert this CSV into our 2nd data structure: List

In [4]:
articles_list = []

with open('articles.csv',  encoding="utf8") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    
    for row in csv_reader:
        articles_list.append(row)
    

## Let's see how many articles are in our list (a.k.a length of the list), and how many columns of metadata each article has (a.k.a length of the first item):

In [None]:
total_articles = #TODO: GET LENGTH OF THE LIST 
total_columns = #TODO: GET LENGTH OF FIRST ITEM IN LIST

print('Total number of articles:', total_articles) #a.k.a the length of the list

print('Total number of columns:', total_columns) #a.k.a the length of the first object in the list

## A list wouldn't be a list if we couldn't iterate through the items.

## Let's use Pythons built in range function [0:9] to print the titles of the first 10 articles:

In [5]:
for article in articles_list[0:10]: #not ideal that the first item is always the column name
    
    print(article[4], end="\n\n") 

title

Consumers Energy planning solar power plant in Cadillac

Google engineers say renewable energy won’t solve climate change

NASA's Parker Solar Probe set to 'touch the Sun' on historic mission - Fox News

Weird solar science: How NASA's Parker probe will dive through the Sun's atmosphere

South Australia is giving away Tesla batteries so it can build the world's biggest virtual power plant

Solar-powered quadcopter drone takes flight

California may require solar panels on new homes in 2020

Samsung commits to using only renewable energy by 2020

Tesla shareholders vote on Elon Musk's ambitious pay package



## With a list, we can easily get some basic stats on the articles and iterate through the items.

## But if we want to add a new property to each item, lists can be difficult to work with.. so it's best to convert our list items into our 3rd data structure : Dictionary

## Let's convert the original CSV to a Dictionary this time, and print the first 10 articles

In [6]:
articles_dict = [a for a in csv.DictReader(open('articles.csv', encoding="utf8"))]

for article in articles_dict[0:10]:

    #QUESTION: WHY IS THIS METHOD OF REFERENCING THE COLUMN BETTER THAN THE LIST METHOD?
    print(article['title'], end="\n\n") 

Consumers Energy planning solar power plant in Cadillac

Google engineers say renewable energy won’t solve climate change

NASA's Parker Solar Probe set to 'touch the Sun' on historic mission - Fox News

Weird solar science: How NASA's Parker probe will dive through the Sun's atmosphere

South Australia is giving away Tesla batteries so it can build the world's biggest virtual power plant

Solar-powered quadcopter drone takes flight

California may require solar panels on new homes in 2020

Samsung commits to using only renewable energy by 2020

Tesla shareholders vote on Elon Musk's ambitious pay package

There's new evidence that fossil fuels are getting crushed in the ongoing energy battle against renewables



## Now let's add our new property to each article (word count of article)

In [9]:
for article in articles_dict:
    
    words = # TODO: create a list of words using the split() function
    word_count = # TODO: get the length of the words list
    
    article['word_count'] = word_count
    
print(article[0])

SyntaxError: invalid syntax (<ipython-input-9-0abb4292c35b>, line 3)

## Working with Dictionaries can accomplish this task, but compiling aggregations (groupings) is a bit tricky.

## Let's convert our dataset into our 4th data structure : Data Frame 

## We can use the popular library Pandas to convert straight from our Dictionary to a DataFrame:

In [11]:
import pandas as pd
from pandas import json_normalize

df = pd.DataFrame.from_dict(json_normalize(articles_dict), orient='columns')

df.info() #displays some basic info on the dataframe
df.head() #prints out the first 5 items

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1316 entries, 0 to 1315
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           1316 non-null   object
 1   publishedAt  1316 non-null   object
 2   description  1316 non-null   object
 3   source       1316 non-null   object
 4   title        1316 non-null   object
dtypes: object(5)
memory usage: 51.5+ KB


Unnamed: 0,id,publishedAt,description,source,title
0,3528564,2019-02-23T15:10:34Z,"CADILLAC, Mich. (AP) — Consumers Energy is pla...",Associated Press,Consumers Energy planning solar power plant in...
1,3528565,2018-06-03T18:06:40Z,Can climate change be solved with technologies...,Fox News,Google engineers say renewable energy won’t so...
2,3528566,2018-08-09T09:26:02Z,Fox News NASA's Parker Solar Probe set to 'tou...,Fox News,NASA's Parker Solar Probe set to 'touch the Su...
3,3528567,2018-08-10T11:00:00Z,NASA’s Parker Solar Probe will soon blast off ...,Fox News,Weird solar science: How NASA's Parker probe w...
4,3528568,2018-02-06T16:08:15Z,Mark Brake / Getty Images The South Australian...,Business Insider,South Australia is giving away Tesla batteries...


## Now we can easily produce some aggregations like: top 10 sources

In [15]:
df_sources = df.groupby(['source']).agg({ #QUESTION: WHY ARE WE COPYING INTO A NEW DATA FRAME?
    'id': 'count'
}).reset_index()

df_sources.sort_values(by=['id'], inplace=True, ascending=False)

df_sources.columns = ['source', 'num_articles']

df_sources.head(10)

Unnamed: 0,source,num_articles
15,Google News,344
23,Reuters,226
2,Associated Press,111
4,Bloomberg,78
27,The Guardian (AU),72
6,Business Insider,61
8,CNBC,59
31,The Wall Street Journal,26
16,Independent,25
29,The New York Times,25


## It might be nice to see this data visualized, and luckily a Data Frame transposes nicely into many Python viz libraries.

## We'll use one called Altair to produce a vertical bar chart of our data:

In [22]:
import altair as alt

alt.Chart(df_sources.head(10)).mark_bar().encode(
    x='num_articles',
    y='source'
).properties(
    title='Number of Articles by Source',
    width=750
)

## This is interesting but doesn't really tell us much about what the articles are talking about.

## Let's apply some data science using Natural Language Processing to help us to extract some more meaningful metadata from the articles.

## We can use a popular NLP library called SpaCy to do the text processing:

https://explosion.ai/demos/displacy-ent

In [27]:
import spacy as sp
nlp = sp.load("en_core_web_sm")

In [86]:
def extract_locations(text):

    doc = nlp(text)
    ents = [e.text for e in doc.ents if e.label_ == "LOC"]
    
    if len(ents) == 0:
        ents = None
    else:
        ents = ents[0]
        
    return ents

In [87]:
df['locations'] = df['description'].map(extract_locations)

## Let's aggregate and visualize our new locations column to show the top 10 locations mentioned in the articles:

In [89]:
df_locations = df.groupby(['locations']).agg({
    'id': 'count'
}).reset_index()

df_locations.sort_values(by=['id'], inplace=True, ascending=False)

df_locations.columns = ['location', 'num_articles']

df_sources.head(10)

Unnamed: 0,source,num_articles
15,Google News,344
23,Reuters,226
2,Associated Press,111
4,Bloomberg,78
27,The Guardian (AU),72
6,Business Insider,61
8,CNBC,59
31,The Wall Street Journal,26
16,Independent,25
29,The New York Times,25


In [90]:
alt.Chart(df_locations.head(10)).mark_bar().encode(
    x='num_articles',
    y='location'
).properties(
    title='Number of Articles by Location',
    width=750
)

In [85]:
#TODO: CREATE ANOTHER COLUMN IN OUR df DATAFRAME TO EXTRACT THE ORGANIZATIONS (ORG) FROM THE ARTICLES

#TODO: CREATE A NEW DATAFRAME CALLED df_organizations AND GROUP BY THE ORGANIZATION

#TODO: CREATE A NEW BAR CHART TO VISUALIZE THE ORGANIZATION DATA