**File Processing Across Science Domains with Python**

Abstract:

I want to do a half research and half coding project, where I use python to read in and parse different file types from difference science domains. For instance, some disciplines store everything in text files, others in json, others in specialized file types. I’d like to pick 3-5 science domains, find some free data sources, research how to handle the files and any metadata, and try to develop a program that will read in and make sense of the file contents for each dataset. By making sense of the contents, I mean the data should be formatted in either a human readable way or according to the domain standard.

The purpose of this project is to:

- Work on my file reading abilities in python
- Investigate a systemic way of handling files in python from different perspectives
- Understand how to handle metadata
- Develop skills on how to handle different types of data

Questions:
- What domains will I focus on and how many?
    - so far, three: Climate + Weather, Oceanography, Earthquakes
    - considering doing a social science as well, so I don't just have physical science domains represented
- What format is the secondary data coming in as?
    - so far, csv, xml, geojson
- What do can read this file type?
    - base python, still working on this
- What will the data look like? Dim, dtype, size, NAs, columns
    - most likely, numbers, dates, descriptions
- What do I need to do to change it?
    - Additional libraries? Changing dtypes? Re-shaping? Calculations? Identifying useful columns?
- What is the finished product? 
- What can I do with the finished product?
    - thoughts so far- I want to get it to a state where it is ready to be visualized or analyzed but not acutally do those things. This is really going to depend on what visual or analysis is going to be done- so I'll find a use case and try to get the data to fit in that specific domain use case. 

### Cilmate + Weather

##### Datasets:

- http://climate.weather.gc.ca/index_e.html
- Historical data
- Radar data
- http://climate.weather.gc.ca/prods_servs/cdn_climate_summary_e.html
- Monthly summaries
- csv and xml downloads


- https://www.esrl.noaa.gov/gmd/grad/surfrad/surf_check.php
- Plot examples

##### Research
- Python resources
- https://drclimate.wordpress.com/2016/10/04/the-weatherclimate-python-stack/
- https://scitools.org.uk/iris/docs/latest/userguide/iris_cubes.html
- xarray
- https://arm-doe.github.io/pyart/

In [183]:
!pip install fastkml

Collecting fastkml
[?25l  Downloading https://files.pythonhosted.org/packages/55/10/981bae93dfd4a43cd3a4d7702789d195484ddce142842fb505bd0919ef37/fastkml-0.11.tar.gz (66kB)
[K    100% |████████████████████████████████| 71kB 1.9MB/s ta 0:00:011
[?25hCollecting pygeoif (from fastkml)
  Downloading https://files.pythonhosted.org/packages/f0/a7/fc5df91be602a66aaae21213e6eb9b9b8039c8074b6515c570b5110b9108/pygeoif-0.7.tar.gz
Building wheels for collected packages: fastkml, pygeoif
  Running setup.py bdist_wheel for fastkml ... [?25ldone
[?25h  Stored in directory: /Users/swalkow/Library/Caches/pip/wheels/55/01/ea/6191eb73e0894743d02b33a2b1a570e85242844d810804fbf2
  Running setup.py bdist_wheel for pygeoif ... [?25ldone
[?25h  Stored in directory: /Users/swalkow/Library/Caches/pip/wheels/60/6e/a7/3d3eef59ac84a86663d0f5c5a92091f5056e9aeb6588c4de34
Successfully built fastkml pygeoif
Installing collected packages: pygeoif, fastkml
Successfully installed fastkml-0.11 pygeoif-0.7
[33mYou ar

In [194]:
import fastkml as kml

In [198]:
with open('Icethickness.kml', 'rt') as myfile:
    doc=myfile.read()

In [201]:
print(doc)

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
  <name>Ice thickness</name>
  <description><![CDATA[]]></description>
  <Style id="style2">
    <IconStyle>
      <Icon>
        <href>http://maps.gstatic.com/intl/en_ca/mapfiles/ms/micons/blue-dot.png</href>
      </Icon>
    </IconStyle>
  </Style>
  <Style id="style8">
    <IconStyle>
      <Icon>
        <href>http://maps.gstatic.com/intl/en_ca/mapfiles/ms/micons/blue-dot.png</href>
      </Icon>
    </IconStyle>
  </Style>
  <Style id="style3">
    <IconStyle>
      <Icon>
        <href>http://maps.gstatic.com/intl/en_ca/mapfiles/ms/micons/blue-dot.png</href>
      </Icon>
    </IconStyle>
  </Style>
  <Style id="style6">
    <IconStyle>
      <Icon>
        <href>http://maps.gstatic.com/intl/en_ca/mapfiles/ms/micons/blue-dot.png</href>
      </Icon>
    </IconStyle>
  </Style>
  <Style id="style4">
    <IconStyle>
      <Icon>
        <href>http://maps.gstatic.com/intl/en_ca/mapfiles/

In [179]:
import pandas as pd

In [180]:
ice_data = pd.read_excel('Ice_thickness.xls')

In [182]:
ice_data

Unnamed: 0,Table/Tableau 1: Explanation of Attributes/Explication des attributs,Unnamed: 1,Unnamed: 2
0,Column/\nColonne,Title/Titre,Description
1,A,Station ID/ID de station,Station ID/ID de station
2,B,Station Name/\nNom de station,Station Name/Nom de station
3,C,Relevant Date/\nDate pertinente,The date when the ice measurement is taken / D...
4,D,Ice Thickness/\nÉpaisseur de la glace,Measured ice thickness to the nearest whole ce...
5,E,Snow Depth/Profondeur de la neige,Average snow depth to the nearest whole centim...
6,F,Surface Code/Code de surface,Surface features at the measurement site and s...
7,G,Water Feature/Caractéristiques d'eau,The presence and orientation of cracks and lea...
8,H,Method of Observation/\nMéthode d'observation,The method used to measure or estiamte the ice...
9,,,


### Oceanography

##### Datasets:

- https://planetos.com/
    - https://data.planetos.com/datasets/noaa_blended_sea_winds_clim_global
    - gridded 

In [158]:
import json
import csv

In [159]:
with open("Oceanography_JSON.json") as f:
    oj = json.load(f)

In [160]:
print(oj.keys())

dict_keys(['stats', 'entries'])


In [161]:
print(oj.values())

dict_values([{'timeMin': '2000-01-15T00:00:00', 'count': 10, 'offset': 0, 'nextOffset': 10, 'timeMax': '2000-12-15T00:00:00'}, [{'context': 'time_zlev_lat_lon', 'axes': {'time': '2000-01-15T00:00:00', 'z': 10.0, 'latitude': 49.5, 'longitude': -50.49999999999997}, 'data': {'u': 5.937192916870117, 'v': 0.11202297359704971, 'w': 11.47663688659668}}, {'context': 'time_zlev_lat_lon', 'axes': {'time': '2000-02-15T00:00:00', 'z': 10.0, 'latitude': 49.5, 'longitude': -50.49999999999997}, 'data': {'u': 4.743940353393555, 'v': 0.5978343486785889, 'w': 10.427130699157715}}, {'context': 'time_zlev_lat_lon', 'axes': {'time': '2000-03-16T00:00:00', 'z': 10.0, 'latitude': 49.5, 'longitude': -50.49999999999997}, 'data': {'u': 3.575047731399536, 'v': -0.9129931330680847, 'w': 9.554756164550781}}, {'context': 'time_zlev_lat_lon', 'axes': {'time': '2000-04-15T00:00:00', 'z': 10.0, 'latitude': 49.5, 'longitude': -50.49999999999997}, 'data': {'u': 1.7229785919189453, 'v': -0.20937982201576233, 'w': 8.13112

In [172]:
print(oj['entries'][0])

{'context': 'time_zlev_lat_lon', 'axes': {'time': '2000-01-15T00:00:00', 'z': 10.0, 'latitude': 49.5, 'longitude': -50.49999999999997}, 'data': {'u': 5.937192916870117, 'v': 0.11202297359704971, 'w': 11.47663688659668}}


In [177]:
print(oj['stats'])

{'timeMin': '2000-01-15T00:00:00', 'count': 10, 'offset': 0, 'nextOffset': 10, 'timeMax': '2000-12-15T00:00:00'}


In [162]:
import pandas as pd

In [163]:
oj_csv = pd.read_csv('Oceanography_csv.txt')

In [166]:
oj_csv

Unnamed: 0,axis:latitude,axis:longitude,axis:time,axis:z,data:u,data:v,data:w
0,49.5,-50.5,2000-01-15T00:00:00,10.0,5.937193,0.112023,11.476637
1,49.5,-50.5,2000-02-15T00:00:00,10.0,4.74394,0.597834,10.427131
2,49.5,-50.5,2000-03-16T00:00:00,10.0,3.575048,-0.912993,9.554756
3,49.5,-50.5,2000-04-15T00:00:00,10.0,1.722979,-0.20938,8.131122
4,49.5,-50.5,2000-05-15T00:00:00,10.0,1.266216,0.241412,6.245647
5,49.5,-50.5,2000-06-15T00:00:00,10.0,2.146084,0.942541,5.742968
6,49.5,-50.5,2000-07-15T00:00:00,10.0,1.829829,2.622753,5.481566
7,49.5,-50.5,2000-08-15T00:00:00,10.0,2.097444,1.668255,5.771543
8,49.5,-50.5,2000-09-15T00:00:00,10.0,2.657856,0.543245,6.975606
9,49.5,-50.5,2000-10-15T00:00:00,10.0,2.837727,0.14493,8.257331


### Earthquakes

##### Datasets:

- https://earthquake.usgs.gov/earthquakes/search/
    - csv, xml or geojson
    - timeseries
    - https://stackoverflow.com/questions/42753745/how-can-i-parse-geojson-with-python

In [12]:
! pip install geojson

Collecting geojson
  Downloading https://files.pythonhosted.org/packages/f1/34/bc3a65faabce27a7faa755ab08d811207a4fc438f77ef09c229fc022d778/geojson-2.4.1-py2.py3-none-any.whl
Installing collected packages: geojson
Successfully installed geojson-2.4.1
[33mYou are using pip version 18.0, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [28]:
import geojson

In [31]:
with open("Earthquake_GeoJson.json") as f:
    gj = geojson.load(f)

In [32]:
print(gj.keys())

dict_keys(['type', 'metadata', 'bbox', 'features'])


In [132]:
features = gj['features'][0]
print(features)
# has the most data
# what is useful to extract and transform?
# coordinates, magnitude, magtype, place, sources, time

{"geometry": {"coordinates": [-175.7399, -22.6763, 10], "type": "Point"}, "id": "us700035gt", "properties": {"alert": null, "cdi": null, "code": "700035gt", "detail": "https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us700035gt&format=geojson", "dmin": 8.624, "felt": null, "gap": 116, "ids": ",us700035gt,", "mag": 4.9, "magType": "mb", "mmi": null, "net": "us", "nst": null, "place": "169km SSW of `Ohonua, Tonga", "rms": 0.74, "sig": 369, "sources": ",us,", "status": "reviewed", "time": 1555174874150, "title": "M 4.9 - 169km SSW of `Ohonua, Tonga", "tsunami": 0, "type": "earthquake", "types": ",geoserve,origin,phase-data,", "tz": -720, "updated": 1555177471040, "url": "https://earthquake.usgs.gov/earthquakes/eventpage/us700035gt"}, "type": "Feature"}


In [131]:
print(gj["type"])
print()
# metadata could be useful
print(gj['metadata'])
print()
# could be useful for data visualization
print(gj['bbox'])

FeatureCollection

{'generated': 1555179250000, 'url': 'https://earthquake.usgs.gov/fdsnws/event/1/query.geojson?starttime=2019-03-14%2000:00:00&endtime=2019-04-13%2023:59:59&minmagnitude=4.5&orderby=time', 'title': 'USGS Earthquakes', 'status': 200, 'api': '1.8.1', 'count': 468}

[-179.8822, -64.6529, 2.8, 179.3101, 85.1725, 616.53]


##### Research
- https://earthquake.usgs.gov/research/
- https://github.com/NCAR/chords/wiki/JSON-vs-GeoJSON
    - it's doesn't seem like there are any technical differences between JSON and GEOJSON
    - GEOJSON seems to be streamlined way of storing geographic data, and is any data that is bound by coordinates in space
    - so I can break this down like a JSON!

### Social Science

##### Datasets:
    - https://cooldatasets.com/

In [94]:
import csv
import pandas as pd

In [129]:
# try twitter dataset- if can't be fixed find a new one
# Find issue in the rows - find a way to exclude difficult rows
# row 70 has an issue
twitter_pd = pd.read_csv('twitter-hate-speech-classifier-DFE-a845520.csv', sep=',', nrows=69, header = 0)

In [107]:
twitter_pd.head(5)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,does_this_tweet_contain_hate_speech,does_this_tweet_contain_hate_speech:confidence,_created_at,orig__golden,orig__last_judgment_at,orig__trusted_judgments,orig__unit_id,orig__unit_state,_updated_at,orig_does_this_tweet_contain_hate_speech,does_this_tweet_contain_hate_speech_gold,does_this_tweet_contain_hate_speech_gold_reason,does_this_tweet_contain_hate_speechconfidence,tweet_id,tweet_text
0,853718217,True,golden,86,,The tweet uses offensive language but not hate...,0.6013,,True,,0,615561535,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,1666196150,Warning: penny boards will make you a faggot
1,853718218,True,golden,92,,The tweet contains hate speech,0.7227,,True,,0,615561723,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,429512078,Fuck dykes
2,853718219,True,golden,86,,The tweet contains hate speech,0.5229,,True,,0,615562039,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,395623778,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandon...
3,853718220,True,golden,98,,The tweet contains hate speech,0.5184,,True,,0,615562068,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,497514685,"""@jayswaggkillah: ""@JacklynAnnn: @jayswaggkill..."
4,853718221,True,golden,88,,The tweet uses offensive language but not hate...,0.5185,,True,,0,615562488,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,588923553,@Zhugstubble You heard me bitch but any way I'...


In [93]:
with open('twitter-hate-speech-classifier-DFE-a845520.csv', 'r') as twitter_data:
    csv_reader = csv.reader(twitter_data, delimiter=',')
    line_count = 0
    print(csv_reader)
    for row in csv_reader:
        print(row)

<_csv.reader object at 0x1101a03c8>
['_unit_id', '_golden', '_unit_state', '_trusted_judgments', '_last_judgment_at', 'does_this_tweet_contain_hate_speech', 'does_this_tweet_contain_hate_speech:confidence', '_created_at', 'orig__golden', 'orig__last_judgment_at', 'orig__trusted_judgments', 'orig__unit_id', 'orig__unit_state', '_updated_at', 'orig_does_this_tweet_contain_hate_speech', 'does_this_tweet_contain_hate_speech_gold', 'does_this_tweet_contain_hate_speech_gold_reason', 'does_this_tweet_contain_hate_speechconfidence', 'tweet_id', 'tweet_text']
['853718218', 'TRUE', 'golden', '92', '', 'The tweet contains hate speech', '0.7227', '', 'TRUE', '', '0', '615561723', 'golden', '', 'The tweet contains hate speech', 'The tweet contains hate speech\nThe tweet uses offensive language but not hate speech', '', '1', '429512078', 'Fuck dykes']
['853718219', 'TRUE', 'golden', '86', '', 'The tweet contains hate speech', '0.5229', '', 'TRUE', '', '0', '615562039', 'golden', '', 'The tweet conta

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 5123: invalid start byte

In [75]:
twitter_open = open('twitter-hate-speech-classifier-DFE-a845520.csv', 'r')

In [78]:
test = .read(twitter_open)

SyntaxError: invalid syntax (<ipython-input-78-04237c89d8ee>, line 1)