<center>

<img src="https://raw.githubusercontent.com/pyladies-bcn/pyladies_latex_template/master/pyladies.png" WIDTH=600> 

<h1>
WORSHOP<br>
"Python for Journalists"<br>
</h1>

<dl>
<dt><br></dt>
<dt>Cristina Ramón @crisodisy</dt>
<dt>Núria Pujol @llevaNEUS</dt>
<dt>Laura Pérez @lpmayos</dt>
</dl> 

</center>

In this session we are going to teach you how to use Python with 3 examples that solve represenative everyday journalists tasks related to data management.  
  * Obtain statistics and usefull information from data in .CSV files.  
  * Introduction about how to obtain information from Twitter (if we have enough time).

##0. Download all required files

All required files are in PyLadiesBCN GitHub online repository and can be donwloaded easily:  
  * Go to https://github.com/pyladies-bcn/python_for_journalists_j  
  * Push 'Download Zip' button to download it.
  * Unzip files in your computer.
  * Open IPython Notebook in the same directory.

# "TARJETAS BLACK"

##1. Obtaining and Cleaning data

Data is usually messy and not ready to be analysed. We will to “clean” it before we can start working with.

For this exercise we are going to use a real data frame: "Tarjetas Black" from Caja Madrid. You can obtain this information from diferents sites. For example:
* http://www.cuartopoder.es/multimedia/2014/10/11/gastos-de-los-exdirectivos-de-caja-madrid-uno-a-uno-con-las-tarjetas-negras-tabla/3403  
* https://github.com/splatsh/tarjetasblack

###1.1 Check that we have all necessary files in current directory

First of all, we have to ensure that all files we downloded from PyLadiesBCN repository are in the working directory and all data files are available.
In this exercice we are going to work with data stored in **"tarjetas.csv"**.

In [1]:
import os
print os.getcwd()
print os.listdir('.')

We have to import pandas library and establish the character encoding to be able to work with data in this file.

Pandas is a Python module especialized to work on data structures (.csv files, databases etc.) providing statictics and plotting functionalities. 

In [2]:
import pandas as pd 
import sys  
reload(sys)
sys.setdefaultencoding('utf8')

###1.2 Load .csv in a dataframe with the data

In [3]:
df = pd.read_csv("tarjetas.csv")

In [4]:
df.shape

In [5]:
df.columns

Now, we can take a look at data obteined from our .csv file.

In [6]:
df.head()

And we can also obtain the data description. 

In [7]:
df.describe()

##1.3 Select data of our interest

In [8]:
df[df["actividad"] == "COCHE"]

In [9]:
df[df["fecha"] == "2003-05-08"]

In [10]:
df[df["importe"] > 10000]

In [11]:
df.ix[30:35, 0:5]

In [12]:
df["importe"][0:5]

## 1.4 Create a new column 

Let's convert money to $

In [13]:
df["dolar"] = df["importe"]*1.13

In [14]:
df.head()

## 1.5 Obtain information by Grouping

How many people had a "black card"?

In [15]:
df["nombre"].groupby(df.nombre).count()

Let's save it in a dataframe 

In [16]:
df2 = pd.Series(df["nombre"].groupby(df.nombre).count())

In [17]:
df2 = pd.DataFrame(df2)

In [18]:
df2["media"] = df["importe"].groupby(df.nombre).mean()

In [19]:
df2.head()

Rename columns

In [20]:
df2.columns = ["recuento", "media"]

In [21]:
df2.head()

In [22]:
df2["maximo"] = df["importe"].groupby(df.nombre).max()

In [23]:
df2["total"] = df["importe"].groupby(df.nombre).sum()

In [24]:
df2.head(20)

In [25]:
df2["total"].sum()

In [26]:
df2["porcentaje"] = (df2["total"] / df2["total"].sum())*100

In [27]:
df2.head()

We can order information, for example by total. 

In [28]:
df3 = df2.sort(["total"], ascending = False)

In [29]:
df3.head(20)

## 1.6 Plot Data (Fast chart!)

In [30]:
%matplotlib inline
import matplotlib.pyplot as plt

In [31]:
df3["total"][0:10].plot(kind="bar")

## 1.7 Another method for grouping 

In [32]:
import collections

In [33]:
recuento = collections.Counter(df.nombre)

In [34]:
recuento

This is a dictionary¸ for this reason we have keys and values

In [35]:
recuento.keys()

In [36]:
recuento["Miguel Blesa de la Parra"]

In [37]:
recuento["Rodrigo de Rato Figaredo"]

Let's try with a random person like **Rodrigo Rato**

In [38]:
rato = df[df["nombre"] == "Rodrigo de Rato Figaredo"]

In [39]:
rato.shape

In [40]:
rato.head()

In [41]:
rato["importe"].sum()

Rato spent almost 100.000 euros in more than 500 purchases

In [42]:
rato["importe"].groupby(rato.actividad).count()

##2. Basic Twitter API

We are going to use Python to access the Twitter API. You will need a Twitter Account and register an app at Twitter website: https://apps.twitter.com/. With Python we will read and save our timeline and we will search a keyword and save the results. 

We have to follow this simple steps:  
    
* Go to apps.twitter.com
* Log in (you need to have a twitter account)
* Create a new app
* Get consumer key and consumer secret
* Get access token and access token secret

Search will be rate limited at 180 queries per 15 minute window for the time 
being, but we may adjust that over time.

###2.1 Read our timeline and save results in a dataframe

Import all necessary Python modules:

In [43]:
import tweepy
import json
import pandas as pd
import csv

Asign following variables with keys values obtained after your registration in twitter as developers. 

In [44]:
consumer_key = #<= your Consumer Key HERE
consumer_secret = #<= your Consumer Secret HERE
access_token = #<= your Access Token HERE
access_token_secret = #<= your Access Token Secret

In [45]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [46]:
api = tweepy.API(auth)

Geting last 10 tweets in your timeline.

In [47]:
timeline = api.home_timeline(count = 10)

In [48]:
usuarios = []
screen_name = []
texto = []
retweet_count = []

In [49]:
for i in range(len(timeline)):
    usuarios.append(timeline[i].user.name)
    screen_name.append(timeline[i].user.screen_name)
    texto.append(timeline[i].text)
    retweet_count.append(timeline[i].retweet_count)

With this information we create a DataFrame.
We have work with this data structures in previous exercices.

In [50]:
df = pd.DataFrame(usuarios)
df["user"] = usuarios
df["screen_name"] = screen_name
df["rt"] = retweet_count
df["text"] = texto 
del df[0]

In [51]:
df

In [52]:
file = open("mi_timeline.csv", "w")
df.to_csv(file, sep=",", encoding = "utf-8")
file.close()

###2.2 Search on twitter and save results in a dataframe

In [53]:
busqueda_usuario=[]
busqueda_texto=[]
busqueda_created_at = []

In [54]:
for tweet in tweepy.Cursor(api.search,
                           q="Colau",
                           rpp=100,
                           result_type="recent",
                           include_entities=True,
                           lang= "en",
                           since='2015-04-01',
                           until='2015-06-01').items(100):
    busqueda_created_at.append(tweet.created_at)
    busqueda_usuario.append(tweet.user.name)    
    busqueda_texto.append(tweet.text)

In [55]:
len(busqueda_usuario)

In [56]:
df2 = pd.DataFrame(busqueda_created_at)
df2["fecha"] = busqueda_created_at
df2["usuario"] = busqueda_usuario
df2["texto"] = busqueda_texto
del df2[0]

In [57]:
df2

In [60]:
file = open("mi_busqueda.csv", "w")
df2.to_csv(file, sep=",", encoding = "utf-8")
file.close()