**File Processing Across Science Domains with Python**

Abstract:

I want to do a half research and half coding project, where I use python to read in and parse different file types from difference science domains. For instance, some disciplines store everything in text files, others in json, others in specialized file types. I’d like to pick 3-5 science domains, find some free data sources, research how to handle the files and any metadata, and try to develop a program that will read in and make sense of the file contents for each dataset. By making sense of the contents, I mean the data should be formatted in either a human readable way or according to the domain standard.

The purpose of this project is to:

- Work on my file reading abilities in python
- Investigate a systemic way of handling files in python from different perspectives
- Understand how to handle metadata
- Develop skills on how to handle different types of data

Questions:
- What domains will I focus on and how many?
    - so far, three: Climate + Weather, Oceanography, Earthquakes
    - considering doing a social science as well, so I don't just have physical science domains represented
- What format is the secondary data coming in as?
    - so far, csv, xml, geojson
- What do can read this file type?
    - base python, still working on this
- What will the data look like? Dim, dtype, size, NAs, columns
    - most likely, numbers, dates, descriptions
- What do I need to do to change it?
    - Additional libraries? Changing dtypes? Re-shaping? Calculations? Identifying useful columns?
- What is the finished product? 
- What can I do with the finished product?
    - thoughts so far- I want to get it to a state where it is ready to be visualized or analyzed but not acutally do those things. This is really going to depend on what visual or analysis is going to be done- so I'll find a use case and try to get the data to fit in that specific domain use case. 

### Cilmate + Weather

##### Datasets:

- http://climate.weather.gc.ca/index_e.html
- Historical data
- Radar data
- http://climate.weather.gc.ca/prods_servs/cdn_climate_summary_e.html
- Monthly summaries
- csv and xml downloads


- https://www.esrl.noaa.gov/gmd/grad/surfrad/surf_check.php
- Plot examples

##### Research
- Python resources
- https://drclimate.wordpress.com/2016/10/04/the-weatherclimate-python-stack/
- https://scitools.org.uk/iris/docs/latest/userguide/iris_cubes.html
- xarray
- https://arm-doe.github.io/pyart/

### Oceanography

##### Datasets:

- https://planetos.com/
    - https://data.planetos.com/datasets/noaa_blended_sea_winds_clim_global
    - gridded 

In [135]:
import requests

In [143]:
# trying csv format
dt = requests.get('https://api.planetos.com/v1/datasets/noaa_blended_sea_winds_clim_global/point?origin=dataset-details&lat=49.5&apikey=5c7e35edb8d64126a8794abe73b4d1b8&lon=-50.5&count=10&csv=true&count=50')

In [141]:
# trying json format
djson = requests.get('https://api.planetos.com/v1/datasets/noaa_blended_sea_winds_clim_global/point?origin=dataset-details&lat=49.5&apikey=5c7e35edb8d64126a8794abe73b4d1b8&lon=-50.5&count=10')

In [142]:
print(djson.text)

{
  "stats": {
    "timeMin": "2000-01-15T00:00:00",
    "count": 10,
    "offset": 0,
    "nextOffset": 10,
    "timeMax": "2000-12-15T00:00:00"
  },
  "entries": [{
    "context": "time_zlev_lat_lon",
    "axes": {
      "time": "2000-01-15T00:00:00",
      "z": 10.0,
      "latitude": 49.5,
      "longitude": -50.49999999999997
    },
    "data": {
      "u": 5.937192916870117,
      "v": 0.11202297359704971,
      "w": 11.47663688659668
    }
  }, {
    "context": "time_zlev_lat_lon",
    "axes": {
      "time": "2000-02-15T00:00:00",
      "z": 10.0,
      "latitude": 49.5,
      "longitude": -50.49999999999997
    },
    "data": {
      "u": 4.743940353393555,
      "v": 0.5978343486785889,
      "w": 10.427130699157715
    }
  }, {
    "context": "time_zlev_lat_lon",
    "axes": {
      "time": "2000-03-16T00:00:00",
      "z": 10.0,
      "latitude": 49.5,
      "longitude": -50.49999999999997
    },
    "data": {
      "u": 3.575047731399536,
      "v": -0.9129931330680847,
  

In [140]:
print(dt.text)

axis:latitude,axis:longitude,axis:time,axis:z,data:u,data:v,data:w
49.5,-50.49999999999997,"2000-01-15T00:00:00",10.0,5.937192916870117,0.11202297359704971,11.47663688659668
49.5,-50.49999999999997,"2000-02-15T00:00:00",10.0,4.743940353393555,0.5978343486785889,10.427130699157715
49.5,-50.49999999999997,"2000-03-16T00:00:00",10.0,3.575047731399536,-0.9129931330680847,9.554756164550781
49.5,-50.49999999999997,"2000-04-15T00:00:00",10.0,1.7229785919189453,-0.20937982201576233,8.131121635437012
49.5,-50.49999999999997,"2000-05-15T00:00:00",10.0,1.2662155628204346,0.24141182005405426,6.245646953582764
49.5,-50.49999999999997,"2000-06-15T00:00:00",10.0,2.1460843086242676,0.9425407648086548,5.74296760559082
49.5,-50.49999999999997,"2000-07-15T00:00:00",10.0,1.8298293352127075,2.6227529048919678,5.481565952301025
49.5,-50.49999999999997,"2000-08-15T00:00:00",10.0,2.0974442958831787,1.6682549715042114,5.771543025970459
49.5,-50.49999999999997,"2000-09-15T00:00:00",10.0,2.6578564643859863,0.543

### Earthquakes

##### Datasets:

- https://earthquake.usgs.gov/earthquakes/search/
    - csv, xml or geojson
    - timeseries
    - https://stackoverflow.com/questions/42753745/how-can-i-parse-geojson-with-python

In [12]:
! pip install geojson

Collecting geojson
  Downloading https://files.pythonhosted.org/packages/f1/34/bc3a65faabce27a7faa755ab08d811207a4fc438f77ef09c229fc022d778/geojson-2.4.1-py2.py3-none-any.whl
Installing collected packages: geojson
Successfully installed geojson-2.4.1
[33mYou are using pip version 18.0, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [28]:
import geojson

In [31]:
with open("Earthquake_GeoJson.json") as f:
    gj = geojson.load(f)

In [32]:
print(gj.keys())

dict_keys(['type', 'metadata', 'bbox', 'features'])


In [132]:
features = gj['features'][0]
print(features)
# has the most data
# what is useful to extract and transform?
# coordinates, magnitude, magtype, place, sources, time

{"geometry": {"coordinates": [-175.7399, -22.6763, 10], "type": "Point"}, "id": "us700035gt", "properties": {"alert": null, "cdi": null, "code": "700035gt", "detail": "https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us700035gt&format=geojson", "dmin": 8.624, "felt": null, "gap": 116, "ids": ",us700035gt,", "mag": 4.9, "magType": "mb", "mmi": null, "net": "us", "nst": null, "place": "169km SSW of `Ohonua, Tonga", "rms": 0.74, "sig": 369, "sources": ",us,", "status": "reviewed", "time": 1555174874150, "title": "M 4.9 - 169km SSW of `Ohonua, Tonga", "tsunami": 0, "type": "earthquake", "types": ",geoserve,origin,phase-data,", "tz": -720, "updated": 1555177471040, "url": "https://earthquake.usgs.gov/earthquakes/eventpage/us700035gt"}, "type": "Feature"}


In [131]:
print(gj["type"])
print()
# metadata could be useful
print(gj['metadata'])
print()
# could be useful for data visualization
print(gj['bbox'])

FeatureCollection

{'generated': 1555179250000, 'url': 'https://earthquake.usgs.gov/fdsnws/event/1/query.geojson?starttime=2019-03-14%2000:00:00&endtime=2019-04-13%2023:59:59&minmagnitude=4.5&orderby=time', 'title': 'USGS Earthquakes', 'status': 200, 'api': '1.8.1', 'count': 468}

[-179.8822, -64.6529, 2.8, 179.3101, 85.1725, 616.53]


##### Research
- https://earthquake.usgs.gov/research/
- https://github.com/NCAR/chords/wiki/JSON-vs-GeoJSON
    - it's doesn't seem like there are any technical differences between JSON and GEOJSON
    - GEOJSON seems to be streamlined way of storing geographic data, and is any data that is bound by coordinates in space
    - so I can break this down like a JSON!

### Social Science

##### Datasets:
    - https://cooldatasets.com/

In [94]:
import csv
import pandas as pd

In [129]:
# try twitter dataset- if can't be fixed find a new one
# Find issue in the rows - find a way to exclude difficult rows
# row 70 has an issue
twitter_pd = pd.read_csv('twitter-hate-speech-classifier-DFE-a845520.csv', sep=',', nrows=69, header = 0)

In [107]:
twitter_pd.head(5)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,does_this_tweet_contain_hate_speech,does_this_tweet_contain_hate_speech:confidence,_created_at,orig__golden,orig__last_judgment_at,orig__trusted_judgments,orig__unit_id,orig__unit_state,_updated_at,orig_does_this_tweet_contain_hate_speech,does_this_tweet_contain_hate_speech_gold,does_this_tweet_contain_hate_speech_gold_reason,does_this_tweet_contain_hate_speechconfidence,tweet_id,tweet_text
0,853718217,True,golden,86,,The tweet uses offensive language but not hate...,0.6013,,True,,0,615561535,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,1666196150,Warning: penny boards will make you a faggot
1,853718218,True,golden,92,,The tweet contains hate speech,0.7227,,True,,0,615561723,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,429512078,Fuck dykes
2,853718219,True,golden,86,,The tweet contains hate speech,0.5229,,True,,0,615562039,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,395623778,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandon...
3,853718220,True,golden,98,,The tweet contains hate speech,0.5184,,True,,0,615562068,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,497514685,"""@jayswaggkillah: ""@JacklynAnnn: @jayswaggkill..."
4,853718221,True,golden,88,,The tweet uses offensive language but not hate...,0.5185,,True,,0,615562488,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1,588923553,@Zhugstubble You heard me bitch but any way I'...


In [93]:
with open('twitter-hate-speech-classifier-DFE-a845520.csv', 'r') as twitter_data:
    csv_reader = csv.reader(twitter_data, delimiter=',')
    line_count = 0
    print(csv_reader)
    for row in csv_reader:
        print(row)

<_csv.reader object at 0x1101a03c8>
['_unit_id', '_golden', '_unit_state', '_trusted_judgments', '_last_judgment_at', 'does_this_tweet_contain_hate_speech', 'does_this_tweet_contain_hate_speech:confidence', '_created_at', 'orig__golden', 'orig__last_judgment_at', 'orig__trusted_judgments', 'orig__unit_id', 'orig__unit_state', '_updated_at', 'orig_does_this_tweet_contain_hate_speech', 'does_this_tweet_contain_hate_speech_gold', 'does_this_tweet_contain_hate_speech_gold_reason', 'does_this_tweet_contain_hate_speechconfidence', 'tweet_id', 'tweet_text']
['853718218', 'TRUE', 'golden', '92', '', 'The tweet contains hate speech', '0.7227', '', 'TRUE', '', '0', '615561723', 'golden', '', 'The tweet contains hate speech', 'The tweet contains hate speech\nThe tweet uses offensive language but not hate speech', '', '1', '429512078', 'Fuck dykes']
['853718219', 'TRUE', 'golden', '86', '', 'The tweet contains hate speech', '0.5229', '', 'TRUE', '', '0', '615562039', 'golden', '', 'The tweet conta

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 5123: invalid start byte

In [75]:
twitter_open = open('twitter-hate-speech-classifier-DFE-a845520.csv', 'r')

In [78]:
test = .read(twitter_open)

SyntaxError: invalid syntax (<ipython-input-78-04237c89d8ee>, line 1)