<br><br><br>

## SCI 498 - 410: Online Social Network Analysis
### Lecture 2
<br><br>
### Aron Culotta
### Illinois Institute of Technology  


<br><br><br>


### Last Time:
- configure project
- run `osna web`
- apply for Twitter credentials

### Today:
- json
- pandas
- discuss data for each project
- work on reading your data

<br><br><br><br><br><br>

### JSON

Let's practice reading and writing [JSON](https://en.wikipedia.org/wiki/JSON) files.

In [56]:
d = {'key1': 'value1',
     'key2': 'value2'}
d['key1']
#d['key3']
d

{'key1': 'value1', 'key2': 'value2'}

In [57]:
import json

# create a list of dictionary objects.
dicts = [{'name': 'joe',
          'age': 21,
          'fav_colors': ['red', 'orange']},
         {'name': 'jane',
          'age': 24,
          'fav_colors': ['red', 'black']}]

# open a file handle to write to.
outf = open('test.json', 'wt')  # wt means?

# write to the file, one json object per line.
for d in dicts:
    outf.write(json.dumps(d))  # json.dumps converts a python dict into a json string.
    outf.write('\n')

outf.close()

What does `test.json` contain now?

In [58]:
print(open('test.json').read())

{"name": "joe", "age": 21, "fav_colors": ["red", "orange"]}
{"name": "jane", "age": 24, "fav_colors": ["red", "black"]}



Now, let's read it back in.


In [59]:
dicts2 = []
for line in open('test.json'):
    dicts2.append(json.loads(line))
print(dicts2)

[{'name': 'joe', 'age': 21, 'fav_colors': ['red', 'orange']}, {'name': 'jane', 'age': 24, 'fav_colors': ['red', 'black']}]


In [60]:
dd = {"key": "value"}
dd["key"]

'value'

In [67]:
# We can now compute some statistics from the dicts2 object.
print('found %d objects' % len(dicts2))

found 2 objects


In [68]:
def mean_age(dicts):
    ages = []
    for d in dicts:
        ages.append(d['age'])
    return sum(ages) / len(dicts)

mean_age(dicts2)

22.5

In [71]:
# how frequently is each color mentioned?
from collections import Counter
def fav_colors(dicts):
    counts = Counter() # handy object: dict from object -> int
    for d in dicts:
        counts.update(d['fav_colors'])
    return counts

color_counts = fav_colors(dicts2)
print(color_counts)
print(color_counts['red'])

Counter({'red': 2, 'orange': 1, 'black': 1})
2


In [75]:
print(color_counts.most_common(5))

[('red', 2), ('orange', 1), ('black', 1)]


<br><br><br><br>
### Pandas

[Pandas](https://pandas.pydata.org/) is a commonly used library for dealing with tabular data in Python.

In [76]:
import pandas as pd
df = pd.DataFrame(dicts)
df

Unnamed: 0,age,fav_colors,name
0,21,"[red, orange]",joe
1,24,"[red, black]",jane


In [82]:
df.age.mean()

22.5

In [84]:
dir(df.age)

['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_prepare__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 

In [85]:
df['height'] = [6.1, 5.8]
df

Unnamed: 0,age,fav_colors,name,height
0,21,"[red, orange]",joe,6.1
1,24,"[red, black]",jane,5.8


In [86]:
df.describe()

Unnamed: 0,age,height
count,2.0,2.0
mean,22.5,5.95
std,2.12132,0.212132
min,21.0,5.8
25%,21.75,5.875
50%,22.5,5.95
75%,23.25,6.025
max,24.0,6.1


In [89]:
# pandas allows complex queries
df[(df.age < 22) & (df.height > 6)]

Unnamed: 0,age,fav_colors,name,height
0,21,"[red, orange]",joe,6.1


In [24]:
# pandas can read and write csv files.
df.to_csv('test.csv')
print(open('test.csv').read())

,age,fav_colors,name,height
0,21,"['red', 'orange']",joe,6.1
1,24,"['red', 'black']",jane,5.8



In [27]:
pd.read_csv('test.csv')

Unnamed: 0.1,Unnamed: 0,age,fav_colors,name,height
0,0,21,"['red', 'orange']",joe,6.1
1,1,24,"['red', 'black']",jane,5.8


In [28]:
# let's get rid of that first column, which is the index.
df.to_csv('test.csv', index=False)
print(open('test.csv').read())

age,fav_colors,name,height
21,"['red', 'orange']",joe,6.1
24,"['red', 'black']",jane,5.8



In [30]:
df2 = pd.read_csv('test.csv')
df2

Unnamed: 0,age,fav_colors,name,height
0,21,"['red', 'orange']",joe,6.1
1,24,"['red', 'black']",jane,5.8


In [46]:
# we can also read/write compressed files using gzip
import gzip
df.to_csv('test.csv.gz', index=False)        # by putting .gz suffix, pandas knows to compress
print(gzip.open('test.csv.gz', 'rt').read()) # need to specify that we want to read it as Text 

age,fav_colors,name,height
21,"['red', 'orange']",joe,6.1
24,"['red', 'black']",jane,5.8



In [93]:
pd.read_csv('test.csv.gz')

22.5

In [95]:
df3 = pd.read_csv('test.csv')
df3.age.mean()

22.5

<br><br><br><br>
### Tokenization

We will be dealing with a lot of text data (e.g., tweets). We will have to do some preprocessing to split tweets into words to be used by the machine learning algorithms.

In [32]:
tweet = 'Hi @justinbieber this is fun #yes http://not.a.url.com Right???'
tweet

'Hi @justinbieber this is fun #yes http://not.a.url.com Right???'

In [96]:
def simple_tokenizer(s):
    return s.split()

simple_tokenizer(tweet)

['Hi',
 '@justinbieber',
 'this',
 'is',
 'fun',
 '#yes',
 'http://not.a.url.com',
 'Right???']

In [98]:
simple_tokenizer('hi    there     .\n\nyou')

['hi', 'there', '.', 'you']

In [100]:
import re
re.sub('aaa', 'bbb', 'aaaaais a word')

'bbbaais a word'

In [99]:


# What to do with punctuation? 
# Could be relevant, but let's ignore for now.
import re  # Regular Expressions!
def re_tokenizer(s):
    # \w = [a-zA-Z0-9_]
    return re.sub('\W+', ' ', s.lower()).split()

re_tokenizer(tweet)

['hi',
 'justinbieber',
 'this',
 'is',
 'fun',
 'yes',
 'http',
 'not',
 'a',
 'url',
 'com',
 'right']

In [40]:




# But, we probably want to keep urls together.
# ...and keep mentions and hashtags.
def tweet_tokenizer(s):
    s = re.sub(r'#(\S+)', r'HASHTAG_\1', s)
    s = re.sub(r'@(\S+)', r'MENTION_\1', s)    
    s = re.sub(r'http\S+', 'THIS_IS_A_URL', s)
    return re.sub('\W+', ' ', s.lower()).split()

tweet_tokenizer(tweet)

['hi',
 'mention_justinbieber',
 'this',
 'is',
 'fun',
 'hashtag_yes',
 'this_is_a_url',
 'right']

In [107]:
# other things to know....

set([1,2,1,1,12])

import glob
# https://docs.python.org/3.7/library/glob.html
glob.glob('*.json')

['test.json']

<br><br><br>To the lab!  
https://github.com/tapilab/elevate-osna-starter/blob/master/lessons/week1/README.md#day-2

In [15]:
from IPython.core.display import HTML
HTML(open('../custom.css').read())