# File Processing Across Science Domains with Python

Link to Github project: https://github.com/samwalkow/IS452_FinalProject

**Abstract:**

I want to do a half research and half coding project, where I use python to read in and parse different file types from difference science domains. For instance, some disciplines store everything in text files, others in json, others in specialized file types. I’d like to pick 3-5 science domains, find some free data sources, research how to handle the files and any metadata, and try to develop a program that will read in and make sense of the file contents for each dataset. By making sense of the contents, I mean the data should be formatted in either a human readable way or according to the domain standard.

The purpose of this project is to:

- Work on my file reading abilities in python
- Investigate a systemic way of handling files in python from different perspectives
- Understand how to handle metadata
- Develop skills on how to handle different types of data

**Table of Contents**

- Climate and Weather

- Earthquakes

- Social Science (Twitter data)

- Project Summary



# Cilmate + Weather

##### Possible Datasets:

- http://climate.weather.gc.ca/index_e.html
- Historical data
- Radar data
- http://climate.weather.gc.ca/prods_servs/cdn_climate_summary_e.html
- Monthly summaries
- csv and xml downloads
- https://open.canada.ca/data/en/dataset/8b624b7b-2e8f-436b-b9bd-f31c2e6613cf


##### Research
- Python resources
- https://drclimate.wordpress.com/2016/10/04/the-weatherclimate-python-stack/
- https://scitools.org.uk/iris/docs/latest/userguide/iris_cubes.html
- xarray
- https://arm-doe.github.io/pyart/

##### Domain Summary

I investigated a couple of different ways to read in weather data. I choose a dataset from the Canadian weather website that monitors snow and ice levels. This data came in the form of XML, so I used an xml python library to help me locate the root and show me the tree structure. I read in the file as one long string, and then used xml.etree library to navigate through the data. The xml file format had the data and the metadata within the file, so I had the context I needed to process the data. I didn't need to go back to the website, or look elsewhere for more information. 

Once I had explored the data, I decided to focus on a few variables to extract and look at more closely. I moved down the levels and pulled out the temperature and precipitation nodes which on contained dictionaries of data. I used for loops to cycle through those and pull out specific values such as year, type of rainfall, units and severity of the weather. I used sets to find the unique values.

Now that the values are pulled out of the xml file and dictionaries, we can look at the different categories, and decide how to aggregate data in those categories. We can also access values in the original dictionary to manipulate for analysis and visualization. 

In [756]:
# import xml.etree package
import xml.etree.ElementTree as ET

In [757]:
# open file
with open('eng-almanac-0101-1231.xml', 'rt') as myfile:
    normal_file=myfile.read()

In [758]:
# find the root
root = ET.fromstring(normal_file)

In [759]:
# get the root tag
root.tag

'climatedata'

In [760]:
# get the root value 
root.attrib

{'{http://www.w3.org/TR/xmlschema-1/}schemaLocation': 'http://climate.weather.gc.ca/climate_data/bulkxml/bulkschema.xsd'}

In [761]:
# find temperature values

temperature = []
for child in root:
    #print(child.tag, child.attrib)
    for c in child:
        #print(c.tag, c.attrib)
        for t in c :
            #print(t.tag, t.attrib)
            if "temperature" in t.tag:
                #print(t.attrib)
                #print(t.attrib['class'])
                temps = t.attrib['class']
                temperature.append(temps)
            
print(temperature)

['extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'normalMin', 'normalMean', 'extremeMax', 'extremeMin', 'normalMax', 'norma

In [762]:
# find years and rain values

years = []
rain = []
for child in root:
    #print(child.tag, child.attrib)
    for c in child:
        #print(c.tag, c.attrib)
        for t in c :
            #print(t.tag)
            if "precipitation" in t.tag:
                #print(t.attrib)
                years.append(t.attrib["year"])
                rain.append(t.attrib["class"])

# unique values

print(set(years))
print(set(rain))
print()

# how much data is in there
print(len(years))
print()
print(len(rain))

{'1986', '1975', '1989', '1983', '1974', '2002', '1976', '1994', '1970', '1982', '1969', '1992', '1959', '1991', '1965', '1979', '2004', '1960', '2008', '1985', '1964', '1999', '1988', '1977', '1978', '1995', '1993', '1961', '2000', '1957', '1967', '1972', '1962', '2003', '1966', '2006', '1971', '1998', '1990', '1968', '2007', '1996', '1984', '1973', '2005', '1980', '1963', '1997', '1987', '2001', '1958', '1981'}
{'extremeRainfall', 'extremeSnowfall', 'extremePrecipitation', 'extremeSnowOnGround'}

1464

1464


In [763]:
# what units are there?

unit = []
for child in root:
    #print(child.tag, child.attrib)
    for c in child:
        #print(c.tag, c.attrib)
        for t in c :
            #print(t.tag, t.attrib)
            if "precipitation" in t.tag:
                #print(t.attrib)
                unit.append(t.attrib['units'])
                
                
print(set(unit))

{'mm', 'cm'}


In [764]:
# looking at the whole precipitation dictionary

for child in root:
    #print(child.tag, child.attrib)
    for c in child:
        #print(c.tag, c.attrib)
        for t in c :
            #print(t.tag, t.attrib)
            if "precipitation" in t.tag:
                #print(t.attrib)
                for i in t.attrib.values():
                    print(i)


extremeRainfall
mm
metric
2007
1958-2008
extremeSnowfall
cm
metric
1982
1958-2008
extremePrecipitation
mm
metric
2007
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1962
1958-2008
extremeSnowfall
cm
metric
1966
1958-2008
extremePrecipitation
mm
metric
1962
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1984
1958-2008
extremeSnowfall
cm
metric
1978
1958-2008
extremePrecipitation
mm
metric
1984
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
2001
1958-2008
extremeSnowfall
cm
metric
1966
1958-2008
extremePrecipitation
mm
metric
2001
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
2007
1958-2008
extremeSnowfall
cm
metric
1959
1958-2008
extremePrecipitation
mm
metric
2007
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1997
1958-2008
extremeSnowfall
cm
metric
1991
1958-2008
extremePrecipitation
mm
metric
1997
1958-2008
extreme

1958-2008
extremePrecipitation
mm
metric
1986
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1963
1958-2008
extremeSnowfall
cm
metric
1976
1958-2008
extremePrecipitation
mm
metric
1963
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1980
1958-2008
extremeSnowfall
cm
metric
1971
1958-2008
extremePrecipitation
mm
metric
1980
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1959
1958-2008
extremeSnowfall
cm
metric
1976
1958-2008
extremePrecipitation
mm
metric
1959
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1961
1958-2008
extremeSnowfall
cm
metric
1976
1958-2008
extremePrecipitation
mm
metric
1961
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1964
1960-2008
extremeSnowfall
cm
metric
1976
1960-2008
extremePrecipitation
mm
metric
1964
1960-2008
extremeSnowOnGround
cm
metric
1984
1984-2004
extremeRainfall
mm
metric
1994
195

mm
metric
1996
1958-2006
extremeSnowOnGround
cm
metric
1981
1981-2006
extremeRainfall
mm
metric
1982
1958-2006
extremeSnowfall
cm
metric
1958
1958-2006
extremePrecipitation
mm
metric
1982
1958-2006
extremeSnowOnGround
cm
metric
1981
1981-2006
extremeRainfall
mm
metric
1969
1958-2006
extremeSnowfall
cm
metric
1958
1958-2006
extremePrecipitation
mm
metric
1969
1958-2006
extremeSnowOnGround
cm
metric
1981
1981-2006
extremeRainfall
mm
metric
1970
1958-2006
extremeSnowfall
cm
metric
1958
1958-2006
extremePrecipitation
mm
metric
1970
1958-2006
extremeSnowOnGround
cm
metric
1981
1981-2006
extremeRainfall
mm
metric
1997
1958-2004
extremeSnowfall
cm
metric
1958
1958-2006
extremePrecipitation
mm
metric
1997
1958-2004
extremeSnowOnGround
cm
metric
1981
1981-2006
extremeRainfall
mm
metric
1972
1958-2003
extremeSnowfall
cm
metric
1958
1958-2006
extremePrecipitation
mm
metric
1972
1958-2003
extremeSnowOnGround
cm
metric
1981
1981-2006
extremeRainfall
mm
metric
1981
1958-2006
extremeSnowfall
cm
metri

1981-2008
extremeRainfall
mm
metric
1983
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1983
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1990
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1990
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1999
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1999
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1981
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1981
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1990
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1990
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1973
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1973
1958-20

metric
1995
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1995
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1983
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1983
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
2000
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
2000
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
1993
1958-2008
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
1993
1958-2008
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
2008
1958-2008
†
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
2008
1958-2008
†
extremeSnowOnGround
cm
metric
1981
1981-2008
extremeRainfall
mm
metric
2008
1958-2008
†
extremeSnowfall
cm
metric
1958
1958-2008
extremePrecipitation
mm
metric
2008
1958-2008
†
extremeSnowOnGroun

mm
metric
1976
1958-2007
extremeSnowfall
cm
metric
1958
1958-2007
extremePrecipitation
mm
metric
1976
1958-2007
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1971
1958-2007
extremeSnowfall
cm
metric
1958
1958-2007
extremePrecipitation
mm
metric
1971
1958-2007
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1985
1958-2007
extremeSnowfall
cm
metric
1958
1958-2007
extremePrecipitation
mm
metric
1985
1958-2007
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1996
1958-2007
extremeSnowfall
cm
metric
1958
1958-2007
extremePrecipitation
mm
metric
1996
1958-2007
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1997
1958-2007
extremeSnowfall
cm
metric
1991
1958-2007
extremePrecipitation
mm
metric
1997
1958-2007
extremeSnowOnGround
cm
metric
1981
1981-2007
extremeRainfall
mm
metric
1975
1958-2007
extremeSnowfall
cm
metric
1958
1958-2007
extremePrecipitation
mm
metric
1975
1958-2007
extremeSnowOnGround
cm


extremeSnowfall
cm
metric
1990
1957-2007
extremePrecipitation
mm
metric
1962
1958-2007
extremeSnowOnGround
cm
metric
1996
1980-2007
extremeRainfall
mm
metric
1983
1957-2007
extremeSnowfall
cm
metric
1968
1957-2007
extremePrecipitation
mm
metric
1983
1957-2007
extremeSnowOnGround
cm
metric
1980
1980-2007
extremeRainfall
mm
metric
1966
1957-2006
extremeSnowfall
cm
metric
1964
1957-2006
extremePrecipitation
mm
metric
1966
1957-2006
extremeSnowOnGround
cm
metric
1980
1980-2007


# Earthquakes

##### Possible Datasets:

- https://earthquake.usgs.gov/earthquakes/search/
    - csv, xml or geojson
    - timeseries
    - https://stackoverflow.com/questions/42753745/how-can-i-parse-geojson-with-python

##### Research
- https://earthquake.usgs.gov/research/
- https://github.com/NCAR/chords/wiki/JSON-vs-GeoJSON
    - it's doesn't seem like there are any technical differences between JSON and GEOJSON
    - GEOJSON seems to be streamlined way of storing geographic data, and is any data that is bound by coordinates in space
    - so I can break this down like a JSON!

##### Domain Summary

I found a seismology dataset from a USA government website that is stored in GeoJSON format. I used a python library called 'geojson' to read in and load the data into my variable. Once read in, the data is in a dictionary data structure. I was able to extract the keys and values. I used for loops to unpack the data and find the values, some which had were also dictionaries. I grabbed the values I was interested in, put them into lists, and then aggregated them. I also processed some text, and used code and function from the lectures notes to count values up into a dictionary, and then sorted by values. 

There was some metadata missing with this dataset. For instance, I don't have any units for some of the columns. So I don't know how big a 'gap' is, or what magnitude is measured in. This wasn't contained in the file or on the website. I was able to process the file and grab the data, but the lack of context would make it difficult to do meaningful analysis.  

The data I processed is ready to be visualized or analyzed, at least for the columns I processed. This dataset challenged my python data structures knowledge. However, I enjoyed working with a GeoJSON format. It requires that I think a different way, then when working with a csv file. It requires more data processing, but at the same time the information is can be stored in more ways, leading to a richer set of data. Also I don't need to use Pandas. 

"Done", in this case, means that I extracted the values I wanted and re-formated them to be ready for analysis or visualization. Not every field was processed, but enough data was processed that you could do some basic statistics or exploratory analysis. 

In [453]:
# install package
! pip install geojson

[33mYou are using pip version 18.0, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [454]:
# import package
import geojson

In [455]:
# open file with geojson package
with open("Earthquake_GeoJson.json") as f:
    gj = geojson.load(f)

In [456]:
# grab keys
print(gj.keys())

dict_keys(['type', 'metadata', 'bbox', 'features'])


In [457]:
# look at first entry
feature1 = gj['features'][0]
print(feature1)
# has the most data
# what is useful to extract and transform?
# coordinates, magnitude, magtype, place, sources, time

{"geometry": {"coordinates": [-175.7399, -22.6763, 10], "type": "Point"}, "id": "us700035gt", "properties": {"alert": null, "cdi": null, "code": "700035gt", "detail": "https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us700035gt&format=geojson", "dmin": 8.624, "felt": null, "gap": 116, "ids": ",us700035gt,", "mag": 4.9, "magType": "mb", "mmi": null, "net": "us", "nst": null, "place": "169km SSW of `Ohonua, Tonga", "rms": 0.74, "sig": 369, "sources": ",us,", "status": "reviewed", "time": 1555174874150, "title": "M 4.9 - 169km SSW of `Ohonua, Tonga", "tsunami": 0, "type": "earthquake", "types": ",geoserve,origin,phase-data,", "tz": -720, "updated": 1555177471040, "url": "https://earthquake.usgs.gov/earthquakes/eventpage/us700035gt"}, "type": "Feature"}


In [458]:
# at the rest of keys 
print(gj["type"])
print()
# metadata could be useful
print(gj['metadata'])
print()
# could be useful for data visualization
print(gj['bbox'])

FeatureCollection

{'generated': 1555179250000, 'url': 'https://earthquake.usgs.gov/fdsnws/event/1/query.geojson?starttime=2019-03-14%2000:00:00&endtime=2019-04-13%2023:59:59&minmagnitude=4.5&orderby=time', 'title': 'USGS Earthquakes', 'status': 200, 'api': '1.8.1', 'count': 468}

[-179.8822, -64.6529, 2.8, 179.3101, 85.1725, 616.53]


In [459]:
# next level of keys
feature1.keys()

dict_keys(['type', 'id', 'geometry', 'properties'])

In [460]:
# store 'features' keys in a list
features = gj["features"]

In [461]:
# iterate through list for ids
for i in features:
    print(i["id"])

us700035gt
us700035gm
us700035gd
us700035g9
us700035fl
us700035du
us700035dg
us700035cs
us700035c6
us700035bx
us700035bs
us700035be
us700035bf
us700035bb
us700035af
us700035a1
us7000359x
us7000359g
us7000359e
us7000358c
us70003582
us7000356s
us7000355t
us70003559
us7000350a
us70003505
us700034zz
us700034zs
us700034z9
us700034yw
us700034ya
us700034y7
us700034y1
us700034xq
us700034x8
us700034wx
us700034w6
us700034u1
us700034tc
us700034rw
us700034r9
us700034r4
us700034qh
us700034qe
us700034pc
us700034pa
us700034p6
us700034nf
us700034kz
us700034k3
us700034ju
us700034ji
us700034it
us700034iq
us700034d3
us700034ba
us700034a6
us7000349v
us7000348u
us700033pp
us700033n5
us700033mf
us700033j0
us700033j3
us700033hu
us700033ew
us700033d7
us600032iz
us600032fg
us600032f5
us2000kcxv
us2000kcv2
us2000kcqe
us2000kcm3
us2000kcll
us2000kcii
us2000kchr
us2000kceu
us2000kce1
us2000kccp
ak0194ibqout
us2000kcb1
us2000kca5
us2000kc9y
us2000kc8j
us2000kc6e
us2000kc5f
us2000kc3p
us2000kc35
us2000kc3j
us2000kc

In [462]:
# geometry key contains coordinates and a type (type is all the same, will leave this out)
for i in features:
    print(i["geometry"])

{"coordinates": [-175.7399, -22.6763, 10], "type": "Point"}
{"coordinates": [130.0443, -6.4382, 138.82], "type": "Point"}
{"coordinates": [94.46, 7.326, 10], "type": "Point"}
{"coordinates": [94.5506, 7.3889, 10], "type": "Point"}
{"coordinates": [129.5164, 26.2282, 10], "type": "Point"}
{"coordinates": [101.5666, -5.0697, 33.34], "type": "Point"}
{"coordinates": [151.9572, -5.2006, 57.35], "type": "Point"}
{"coordinates": [145.9517, -5.3576, 79.03], "type": "Point"}
{"coordinates": [129.5703, 26.0711, 10], "type": "Point"}
{"coordinates": [148.6864, -6.522, 38.45], "type": "Point"}
{"coordinates": [129.5757, 26.1999, 10], "type": "Point"}
{"coordinates": [129.5245, 26.2901, 10], "type": "Point"}
{"coordinates": [129.5337, 26.1984, 10], "type": "Point"}
{"coordinates": [129.5472, 26.1982, 10], "type": "Point"}
{"coordinates": [122.6027, -1.8184, 10], "type": "Point"}
{"coordinates": [122.6464, -1.876, 10], "type": "Point"}
{"coordinates": [-77.7363, -1.3571, 179.09], "type": "Point"}
{

In [463]:
# geometry contain coordinates, which is useful for analysis and visualization
print(len(features))
print()
for i in features:
    print(i["geometry"]["coordinates"])

468

[-175.7399, -22.6763, 10]
[130.0443, -6.4382, 138.82]
[94.46, 7.326, 10]
[94.5506, 7.3889, 10]
[129.5164, 26.2282, 10]
[101.5666, -5.0697, 33.34]
[151.9572, -5.2006, 57.35]
[145.9517, -5.3576, 79.03]
[129.5703, 26.0711, 10]
[148.6864, -6.522, 38.45]
[129.5757, 26.1999, 10]
[129.5245, 26.2901, 10]
[129.5337, 26.1984, 10]
[129.5472, 26.1982, 10]
[122.6027, -1.8184, 10]
[122.6464, -1.876, 10]
[-77.7363, -1.3571, 179.09]
[129.2527, -6.1746, 211.89]
[159.0724, 51.2762, 35]
[94.3985, 7.2825, 10]
[122.6147, -1.9204, 9.69]
[167.442, -17.4579, 10]
[122.6147, -1.8535, 20.72]
[-45.9198, 21.3649, 10]
[122.6286, -1.8534, 10]
[129.1867, -6.0211, 246.24]
[148.6682, -6.3993, 10]
[51.8204, 14.1765, 10]
[-126.8693, 40.4108, 5]
[-93.1121, 14.2607, 35]
[-172.7157, -15.4001, 10]
[122.5845, -1.828, 10]
[122.5065, -2.0217, 10]
[122.5527, -1.8518, 17.48]
[150.6005, -6.572, 10]
[131.5683, -3.862, 10]
[87.3838, 30.1695, 10]
[141.2731, 25.8246, 542.77]
[129.4949, 26.2184, 2.8]
[121.4608, 19.9645, 10]
[124.9

In [464]:
# properties seems to have the most useful data, will iterate through to grab values
for i in features:
    print(i["properties"])

{'mag': 4.9, 'place': '169km SSW of `Ohonua, Tonga', 'time': 1555174874150, 'updated': 1555177471040, 'tz': -720, 'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us700035gt', 'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us700035gt&format=geojson', 'felt': None, 'cdi': None, 'mmi': None, 'alert': None, 'status': 'reviewed', 'tsunami': 0, 'sig': 369, 'net': 'us', 'code': '700035gt', 'ids': ',us700035gt,', 'sources': ',us,', 'types': ',geoserve,origin,phase-data,', 'nst': None, 'dmin': 8.624, 'rms': 0.74, 'gap': 116, 'magType': 'mb', 'type': 'earthquake', 'title': 'M 4.9 - 169km SSW of `Ohonua, Tonga'}
{'mag': 4.8, 'place': '218km NW of Saumlaki, Indonesia', 'time': 1555174401568, 'updated': 1555175281040, 'tz': 540, 'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us700035gm', 'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us700035gm&format=geojson', 'felt': None, 'cdi': None, 'mmi': None, 'alert': None, 'status': 'reviewed',

{'mag': 4.7, 'place': '106km SW of Puerto El Triunfo, El Salvador', 'time': 1553627250910, 'updated': 1553630083040, 'tz': -360, 'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us1000jlxq', 'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us1000jlxq&format=geojson', 'felt': None, 'cdi': None, 'mmi': None, 'alert': None, 'status': 'reviewed', 'tsunami': 0, 'sig': 340, 'net': 'us', 'code': '1000jlxq', 'ids': ',us1000jlxq,', 'sources': ',us,', 'types': ',geoserve,origin,phase-data,', 'nst': None, 'dmin': 1.869, 'rms': 1.08, 'gap': 170, 'magType': 'mb', 'type': 'earthquake', 'title': 'M 4.7 - 106km SW of Puerto El Triunfo, El Salvador'}
{'mag': 4.9, 'place': '213km W of Tual, Indonesia', 'time': 1553621851340, 'updated': 1553622967040, 'tz': 540, 'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/us1000jlum', 'detail': 'https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=us1000jlum&format=geojson', 'felt': None, 'cdi': None, 'mmi': None, 'alert': No

In [465]:
# properities I'm interested in looking at

for i in features:
    print(i["properties"]["mag"],i["properties"]["place"], i["properties"]["type"], i["properties"]["gap"])

4.9 169km SSW of `Ohonua, Tonga earthquake 116
4.8 218km NW of Saumlaki, Indonesia earthquake 69
4.9 144km ESE of Mohean, India earthquake 110
5.3 150km ESE of Mohean, India earthquake 97
5 158km ESE of Nago, Japan earthquake 69
4.7 160km SSW of Bengkulu, Indonesia earthquake 184
4.8 99km SSW of Kokopo, Papua New Guinea earthquake 90
5.2 22km SE of Madang, Papua New Guinea earthquake 25
4.6 169km ESE of Nago, Japan earthquake 153
5.2 90km E of Finschhafen, Papua New Guinea earthquake 55
4.6 165km ESE of Nago, Japan earthquake 72
4.9 157km ESE of Nago, Japan earthquake 60
4.9 161km ESE of Nago, Japan earthquake 57
4.9 162km ESE of Nago, Japan earthquake 67
4.8 98km SSW of Luwuk, Indonesia earthquake 36
4.5 103km S of Luwuk, Indonesia earthquake 68
4.8 30km ENE of Puyo, Ecuador earthquake 130
4.7 295km SSE of Saparua, Indonesia earthquake 78
4.6 179km E of Ozernovskiy, Russia earthquake 165
4.9 141km ESE of Mohean, India earthquake 59
4.8 108km S of Luwuk, Indonesia earthquake 37
4.6 98k

5 114km ESE of Mohean, India earthquake 64
4.9 117km ESE of Mohean, India earthquake 110
5.1 116km E of Mohean, India earthquake 26
4.8 149km ENE of Mohean, India earthquake 77
5 119km ESE of Mohean, India earthquake 80
4.9 104km ESE of Mohean, India earthquake 97
5.1 103km E of Mohean, India earthquake 27
4.9 111km E of Mohean, India earthquake 43
5 126km ESE of Mohean, India earthquake 94
4.5 130km ESE of Mohean, India earthquake 122
4.6 150km NE of Ndoi Island, Fiji earthquake 99
4.6 81km W of Onan Ganjang, Indonesia earthquake 127
4.5 136km SW of Vaini, Tonga earthquake 135
5.5 127km SW of Chimbote, Peru earthquake 153
4.9 22km N of La Libertad, Ecuador earthquake 107
4.6 112km W of Nuqui, Colombia earthquake 112
4.7 8km ESE of Pamandzi, Mayotte earthquake 49
5 10km NNE of Acipayam, Turkey earthquake 34
4.7 105km W of Kota Ternate, Indonesia earthquake 58
5.4 35km NNE of Santa Elena, Ecuador earthquake 36
5.1 34km NNE of Santa Elena, Ecuador earthquake 60
5 22km N of Salinas, Ecuad

4.7 South of the Fiji Islands earthquake 69
4.5 14km N of Calanasan, Philippines earthquake 80
4.8 178km SSE of Ishigaki, Japan earthquake 112
5 1km WSW of Ngondokandawu, Indonesia earthquake 82
4.5 South of the Fiji Islands earthquake 63
4.7 191km ESE of Iwaki, Japan earthquake 154
4.7 191km ESE of Iwaki, Japan earthquake 140
4.5 72km W of Coquimbo, Chile earthquake 168
4.5 133km SSE of L'Esperance Rock, New Zealand earthquake 184
4.8 12km ESE of Zakynthos, Greece earthquake 124
4.6 197km NNE of Ile Hunter, New Caledonia earthquake 109
4.5 45km NNW of Northern Islands Municipality - Mayor's Office, Northern Mariana Islands earthquake 82
4.6 Azores Islands region earthquake 90
5.5 4km SW of Sembalunbumbung, Indonesia earthquake 41
4.6 Northern Mid-Atlantic Ridge earthquake 125
4.7 Azores Islands region earthquake 63
4.6 Southern East Pacific Rise earthquake 172
4.7 Azores Islands region earthquake 81
4.5 144km NE of Mombetsu, Japan earthquake 130
4.5 Azores Islands region earthquake 16

In [466]:
# pull out countries from text data - not every row had a country

regions = []
for i in features:
    places = i["properties"]["place"]
    country = places.split(",")
    if len(country)>= 2:
        #print(len(country))
        #print(country)
        regions.append(country[1])
print(regions)

[' Tonga', ' Indonesia', ' India', ' India', ' Japan', ' Indonesia', ' Papua New Guinea', ' Papua New Guinea', ' Japan', ' Papua New Guinea', ' Japan', ' Japan', ' Japan', ' Japan', ' Indonesia', ' Indonesia', ' Ecuador', ' Indonesia', ' Russia', ' India', ' Indonesia', ' Vanuatu', ' Indonesia', ' Indonesia', ' Indonesia', ' Papua New Guinea', ' Yemen', ' California', ' Mexico', ' Tonga', ' Indonesia', ' Indonesia', ' Indonesia', ' Papua New Guinea', ' Indonesia', ' China', ' Japan', ' Japan', ' Philippines', ' Japan', ' Fiji', ' Panama', ' Indonesia', ' Papua New Guinea', ' Papua New Guinea', ' Japan', ' Japan', ' Chile', ' Papua New Guinea', ' Japan', ' Peru', ' Japan', ' India', ' India', ' India', ' India', ' Peru', ' Indonesia', ' Guam', ' New Caledonia', ' Saint Helena', ' Taiwan', ' Taiwan', ' South Sandwich Islands', ' Papua New Guinea', ' Taiwan', ' Colombia', ' Vanuatu', ' Tonga', ' El Salvador', ' Tonga', ' Alaska', ' Papua New Guinea', ' Indonesia', ' Alaska', ' Japan', ' I

In [467]:
# count countries up into a dictionary

region_count = {}

for i in regions:
    if i not in region_count:
        region_count[i] = 1
    else:
        region_count[i] += 1
        
region_count

{' Tonga': 17,
 ' Indonesia': 63,
 ' India': 48,
 ' Japan': 39,
 ' Papua New Guinea': 22,
 ' Ecuador': 10,
 ' Russia': 4,
 ' Vanuatu': 11,
 ' Yemen': 1,
 ' California': 1,
 ' Mexico': 6,
 ' China': 6,
 ' Philippines': 15,
 ' Fiji': 9,
 ' Panama': 2,
 ' Chile': 18,
 ' Peru': 8,
 ' Guam': 6,
 ' New Caledonia': 3,
 ' Saint Helena': 1,
 ' Taiwan': 6,
 ' South Sandwich Islands': 1,
 ' Colombia': 3,
 ' El Salvador': 8,
 ' Alaska': 15,
 ' Mauritius': 2,
 ' India region': 1,
 ' Argentina': 7,
 ' New Zealand': 18,
 ' Nicaragua': 3,
 ' East Timor': 3,
 ' Solomon Islands': 10,
 ' Portugal': 1,
 ' Tajikistan': 1,
 ' South Georgia and the South Sandwich Islands': 3,
 ' Mayotte': 5,
 ' Japan region': 3,
 ' Pakistan': 4,
 ' Turkey': 4,
 ' Uganda': 1,
 ' Federated States of Micronesia': 1,
 ' Bolivia': 4,
 ' Iraq': 2,
 ' Iran': 2,
 ' Greece': 3,
 ' Somalia': 2,
 ' Burma': 2,
 ' Kenya': 1,
 ' Northern Mariana Islands': 5,
 ' Afghanistan': 2,
 ' South Africa': 1,
 ' Turkmenistan': 1,
 ' Tanzania': 1,
 '

In [468]:
# sort dictionary by frequency (value)

def byFreq(pair):
    return pair[1]

pair_list = list(region_count.items())

pair_list.sort(key = byFreq, reverse = True)

for pair in pair_list:
    print(pair[0], pair[1]) 

 Indonesia 63
 India 48
 Japan 39
 Papua New Guinea 22
 Chile 18
 New Zealand 18
 Tonga 17
 Philippines 15
 Alaska 15
 Vanuatu 11
 Ecuador 10
 Solomon Islands 10
 Fiji 9
 Peru 8
 El Salvador 8
 Argentina 7
 Mexico 6
 China 6
 Guam 6
 Taiwan 6
 Mayotte 5
 Northern Mariana Islands 5
 Russia 4
 Pakistan 4
 Turkey 4
 Bolivia 4
 New Caledonia 3
 Colombia 3
 Nicaragua 3
 East Timor 3
 South Georgia and the South Sandwich Islands 3
 Japan region 3
 Greece 3
 Panama 2
 Mauritius 2
 Iraq 2
 Iran 2
 Somalia 2
 Burma 2
 Afghanistan 2
 Bouvet Island 2
 Yemen 1
 California 1
 Saint Helena 1
 South Sandwich Islands 1
 India region 1
 Portugal 1
 Tajikistan 1
 Uganda 1
 Federated States of Micronesia 1
 Kenya 1
 South Africa 1
 Turkmenistan 1
 Tanzania 1
 Guatemala 1


In [469]:
# Average magnitude
m = []
for i in features:
    m.append(i["properties"]["mag"])
    
print(sum(m)/len(m))

4.872008547008539


In [470]:
# average gap size
g = []
for i in features:
    gap = i["properties"]["gap"]
    if isinstance(gap, int):
        g.append(gap)
    
print(sum(g)/len(g))

93.58709677419355


In [471]:
# percentage of tsunami's from the earthquake
p = []
for i in features:
    p.append(i["properties"]["tsunami"])
    
print(sum(p)/len(p))

0.0641025641025641


# Social Science

##### Possible Datasets:
    - https://cooldatasets.com/
        - twitter comment data
        - csv format

##### Domain Summary

I used pandas to read in and process some twitter data, which contains twitter comments that may or may not be considered hateful. The file format was csv, so that was fairly straight forward to process. I had to exclude rows that did not read in correctly, possibly due to a parsing issue. I was still able to access a large chunk of the data without the problematic rows. I also left out the last columm, which contained the actual tweets. 

Once I had everything in a dataframe, and I was able to look at the columns, the shape, the data types and the missing values in the dataset. I was then able to group some of the integer columns by the categorical variables, to create some aggregate data. In this case, the data in the csv file came in a form that was fairly easy to read in and start working with. There was a lot of missing data, and several confusing categories, with no metadata (such as a data dictionary) to help me sort it out. Through this process, I've learned how important it is to context around the data and the data collection process. 

Pandas does make it simple to take tabular data and make it python ready. That being said, I don't find the error messages helpful at all. Pandas as feels instinctive at first, but now that I've also learned R, I just find it really annoying and counter to everything I've learned in python. If it wasn't so easy to use, I would look for something else. 

Analysis ready data for a social science dataset seems to depend on what data type you need the data to be in, and whether or not you have that piece of data isolated into it's own variable. In this case, I was able to upload the data and start using it with no transformation necesssary. The data is ready to be visualized or analyzed. 

In [472]:
import pandas as pd

In [473]:
# try twitter dataset- if can't be fixed find a new one
# Find issue in the rows - find a way to exclude difficult rows
# row 70 has an issue
twitter = pd.read_csv('twitter-hate-speech-classifier-DFE-a845520.csv', sep=',', nrows=69, header = 0)
#twitter = pd.read_csv('twitter-hate-speech-classifier-DFE-a845520.csv', sep=',', skiprows= [69], header = 0)

In [474]:
# I omitted to last column, which has the actual tweets, most of which was extremely offense
twitter_pd = twitter.iloc[:, :19]
twitter_pd

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,does_this_tweet_contain_hate_speech,does_this_tweet_contain_hate_speech:confidence,_created_at,orig__golden,orig__last_judgment_at,orig__trusted_judgments,orig__unit_id,orig__unit_state,_updated_at,orig_does_this_tweet_contain_hate_speech,does_this_tweet_contain_hate_speech_gold,does_this_tweet_contain_hate_speech_gold_reason,does_this_tweet_contain_hate_speechconfidence,tweet_id
0,853718217,True,golden,86,,The tweet uses offensive language but not hate...,0.6013,,True,,0.0,615561535.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,1.666196e+09
1,853718218,True,golden,92,,The tweet contains hate speech,0.7227,,True,,0.0,615561723.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,4.295121e+08
2,853718219,True,golden,86,,The tweet contains hate speech,0.5229,,True,,0.0,615562039.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,3.956238e+08
3,853718220,True,golden,98,,The tweet contains hate speech,0.5184,,True,,0.0,615562068.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,4.975147e+08
4,853718221,True,golden,88,,The tweet uses offensive language but not hate...,0.5185,,True,,0.0,615562488.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,5.889236e+08
5,853718222,True,golden,93,,The tweet contains hate speech,0.8816,,True,,0.0,615562522.0,golden,,The tweet contains hate speech,The tweet contains hate speech,,1.0,2.038160e+08
6,853718223,True,golden,88,,The tweet contains hate speech,0.5207,,True,,0.0,615562768.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,1.945525e+09
7,853718224,True,golden,90,,The tweet contains hate speech,0.5619,,True,,0.0,615563304.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,2.108181e+08
8,853718225,True,golden,92,,The tweet uses offensive language but not hate...,0.6419,,True,,0.0,615563419.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,3.428269e+08
9,853718226,True,golden,95,,The tweet uses offensive language but not hate...,0.6407,,True,,0.0,615563491.0,golden,,The tweet contains hate speech,The tweet contains hate speech\nThe tweet uses...,,1.0,8.184304e+08


In [475]:
# how many columns and rows
twitter_pd.shape

(69, 19)

In [476]:
# variable names
twitter_pd.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'does_this_tweet_contain_hate_speech',
       'does_this_tweet_contain_hate_speech:confidence', '_created_at',
       'orig__golden', 'orig__last_judgment_at', 'orig__trusted_judgments',
       'orig__unit_id', 'orig__unit_state', '_updated_at',
       'orig_does_this_tweet_contain_hate_speech',
       'does_this_tweet_contain_hate_speech_gold',
       'does_this_tweet_contain_hate_speech_gold_reason',
       'does_this_tweet_contain_hate_speechconfidence', 'tweet_id'],
      dtype='object')

In [477]:
# all of the missing values by variable name
twitter_pd.isna().sum()

_unit_id                                            0
_golden                                             0
_unit_state                                         0
_trusted_judgments                                  0
_last_judgment_at                                  67
does_this_tweet_contain_hate_speech                 0
does_this_tweet_contain_hate_speech:confidence      0
_created_at                                        69
orig__golden                                        2
orig__last_judgment_at                             69
orig__trusted_judgments                             2
orig__unit_id                                       2
orig__unit_state                                    2
_updated_at                                        69
orig_does_this_tweet_contain_hate_speech            2
does_this_tweet_contain_hate_speech_gold            2
does_this_tweet_contain_hate_speech_gold_reason    69
does_this_tweet_contain_hate_speechconfidence       2
tweet_id                    

In [478]:
# all of the data types
twitter_pd.dtypes

_unit_id                                             int64
_golden                                               bool
_unit_state                                         object
_trusted_judgments                                   int64
_last_judgment_at                                   object
does_this_tweet_contain_hate_speech                 object
does_this_tweet_contain_hate_speech:confidence     float64
_created_at                                        float64
orig__golden                                        object
orig__last_judgment_at                             float64
orig__trusted_judgments                            float64
orig__unit_id                                      float64
orig__unit_state                                    object
_updated_at                                        float64
orig_does_this_tweet_contain_hate_speech            object
does_this_tweet_contain_hate_speech_gold            object
does_this_tweet_contain_hate_speech_gold_reason    float

In [479]:
# confidence in tweet judgement grouped by hate speech category
confidence = twitter_pd.groupby("does_this_tweet_contain_hate_speech")["does_this_tweet_contain_hate_speech:confidence"].mean()
confidence



does_this_tweet_contain_hate_speech
The tweet contains hate speech                           0.702747
The tweet is not offensive                               0.941700
The tweet uses offensive language but not hate speech    0.827110
Name: does_this_tweet_contain_hate_speech:confidence, dtype: float64

In [480]:
o = twitter_pd.iloc[:,3:4]
o.head()

Unnamed: 0,_trusted_judgments
0,86
1,92
2,86
3,98
4,88


In [481]:
# judges in tweet judgement grouped by hate speech category
judges = twitter_pd.groupby("does_this_tweet_contain_hate_speech")["_trusted_judgments"].mean()
judges

does_this_tweet_contain_hate_speech
The tweet contains hate speech                           90.882353
The tweet is not offensive                               84.304348
The tweet uses offensive language but not hate speech    90.620690
Name: _trusted_judgments, dtype: float64

# Project Summary





I used this project to explore how to read in and parse data from different science domains. When looking for datasets, I found most domains had data in csv files, and I was surprised how available that was. JSON was also easy to find. The easiest to work with was JSON, but the XML file format had the most information contained within it. The csv data, while there was a lot of it, often had many errors or wasn't all usable. It was difficult to troubleshoot csv file errors, and I often had to find a new dataset. 

Once I had the data read in and I figured out how to navigate through the file format, it was just a matter of dealing with lists and dictionaries, and it felt similar to things we had done in class. I used several outside packages to help me read in the data, but after that I used base python for most of my work. I didn't do any analysis or visualization, so the end products for each domain were easily accessible or re-formated data that could be readily used for analysis, visualization, or are just more human readable. 

This project reinforced for me that strong python skills are needed to handle data from any file, and from any domain effectively. While the data can live in many formats, once you extract the values out there are only so many ways to manipulate and move them around. While doing that, I also have to keep the domain context in mind (using things like the metadata) to remind me of the meaning behind what I'm working with. 