# Web-scraping User Profile Information

This Project is using data from a community driven website for traditional Irish music

https://thesession.org/

in which users can submit tunes in '.abc' format.

The websights designer kindly provides portions of the website database in a json format

https://github.com/adactio/TheSession-data

which includes:

- Tune settings (submitted by users)
- Recordings of tunes (which are cross-referenced against tune settings)
- Aliases of tunes (traditional tunes are referred to by different names)
- Sessions and Events (posted by users)

This forms the initial basis of the of the SQL-based database defined in thesessionDB.py 

However the site has additional information that I am interested in using.

Namely the users have the option of providing a location

https://thesession.org/members/97793

which we can then link to their saved tunes

https://thesession.org/members/97793/tunebook

And tunes for which they have submitted settings

https://thesession.org/members/97793/tunes

Scapeing code is located scraper.py

I plan to aggregate this information and link it with the existing the existing thesessionDB.

In [1]:
from __future__ import absolute_import, division, print_function

# scraper.py
from scraper import userProfile, userTunes


print('Testing userProfile:')
up = userProfile(97793)

print('\nUser:',up.userName,up.user_id)

print('\nDescription:\n',up.description)

print('\nLocation:\n',up.location)
print(up.locationInfo['city'])
print(up.locationInfo['county'])
print(up.locationInfo['city'])
print(up.locationInfo['state'])

print('\nTesting userProfile:')
utb = userTunes(97793)

print('\nUser\'s settings:')
print(utb.settings)

print('\nUser\'s tunes:')
print(utb.tunes)


Testing userProfile:
https://thesession.org/members/97793

User: sheaR 97793

Description:
 
I am an anglo concertina and whistle player, living in the East Bay. I play mostly Irish tunes, but also an increasing number of Scottish ones along the occasional Breton/Swedish/French Canadian etc.


Location:
 [37.86214828, -122.26237488]
Berkeley
Alameda County
Berkeley
California

Testing userProfile:
https://thesession.org/members/97793/tunebook?page=1
https://thesession.org/members/97793/tunebook?page=2
https://thesession.org/members/97793/tunebook?page=3
https://thesession.org/members/97793/tunes?page=1

User's settings:
[31140, 31139, 29285, 28952, 28336, 25784]

User's tunes:
[    1     2     5     8     9    10    12    15    16    17    19    20
    21    23    24    26    27    28    29    30    33    34    35    36
    39    42    43    44    49    52    53    54    55    56    62    63
    64    67    69    70    71    73    74    75    76    83    84    86
    88    92    93    

# Proposed goals for location information
Here are a number of possibilities to use this kind of information.

- Distribution of particular tunes

- Variation in popularity of tune types

- Quantifying how 'traditional' a region's repetoires is (i.e. compare to O'neills or likewise).

- Looking for similarities / influences between locations repertoires.

- Generating a list of tunes to try if you are attening a session in a new location (cross-referenced with a user's tunebook).

- Regional connectedness could be included as a feature for a ML-based 

# Tune comment information

The site has another source of information of tunes which is not included in the database provided by Jeremy. Namely the user comments on tune pages include a wealth of more specific information. There would be a variety of interesting task to approach with natural language processing including.

- Origin of tunes (this could be more specific than the previous suggestion using User locations.)
- Connnection to musicians (particularly those pre-dating modern recordings)
- Identifying traditional versus modern tunes
- Instrument-specific repertoires (i.e. piping tunes)
- More abstract traits of the tune itself? (i.e. tempo, feel)
- Related tunes via references to other tune pages (need to be cautious as people suggest tunes in this way)

In [2]:
# I just remembered that there is a json-based api...this is easier than what I did with
# the user information, although it may not have included location

# this might also provide an easier means of finding all of the settings for a given tune

src = 'https://thesession.org/tunes/27?format=json'

import urllib, json

tid  = 27
url = 'https://thesession.org/tunes/'+str(tid)+'?format=json'
print("Downloading tune comment data from %s..." % url)
response = urllib.urlopen(url)
data = json.loads(response.read())



Downloading tune comment data from https://thesession.org/tunes/27?format=json...


In [3]:
data.keys()

[u'name',
 u'format',
 u'url',
 u'settings',
 u'comments',
 u'member',
 u'recordings',
 u'date',
 u'tunebooks',
 u'type',
 u'id',
 u'aliases']

In [4]:
# compare this to the dumped information
tuneDB = json.loads(open('json/tunes.json', 'rb').read().decode('utf8'))
tuneDB[-1].keys()

# New information is 'comments','recordings','tunebooks',
# does not include the played with information (would have to scrape that)

[u'username',
 u'abc',
 u'name',
 u'meter',
 u'setting',
 u'mode',
 u'date',
 u'type',
 u'tune']

In [5]:
# This is only summed data, but could used as a generic popularity feature
print(data['recordings'])
print(data['tunebooks'])

161
5168


In [6]:
print(data['comments'])

[{u'member': {u'url': u'https://thesession.org/members/1', u'id': 1, u'name': u'Jeremy'}, u'url': u'https://thesession.org/tunes/27#comment26', u'content': u'Don&#039;t let the title fool you; there&#039;s nothing drowsy about this tune. It&#039;s usually played at a very fast pace.    The melody itself is quite straightforward with plenty of those pipe-like jumps back to the E note. These jumps can be a bit tricky at first, especially on stringed instruments like the fiddle and banjo. The best solution is to have both the E and B notes permanently fingered for the first few bars.    When this tune gets up to breakneck speed, it might be a good idea to play less notes and lengthen them.', u'date': u'2001-06-01 05:42:29', u'id': 26, u'subject': u''}, {u'member': {u'url': u'https://thesession.org/members/291', u'id': 291, u'name': u'Munsondr'}, u'url': u'https://thesession.org/tunes/27#comment2645', u'content': u'I agree with Jeremy that there&#039;s absolutely nothing &quot;sleepy&quot;

In [7]:
# sample terms to search for
# would also need a good way to prescribe different AKA's
{'instruments':['fiddle','flute','pipe','concetina','accordion','whistle','harp','bouzouki','mandolin'],
'musicians':['Kevin Burke','Michael Coleman','Joe Cooley',"O'Carolan",'John Doherty'],
'groups':['Altan','Bothy Band','Chieftains'],
'locations':['Antrim','Armagh','Carlow','Cavan','Clare','Cork','Donegal',
             'Down','Fermanagh','Galway','Kerry','Kildare','Kilkenny','Laois',
             'Leitrim','Limerick','Londonderry','Derry','Longford','Louth',
             'Mayo','Meath','Monaghan','Offaly','Roscommon','Sligo','Tipperary',
             'Tyrone','Waterford','Wexford','Wicklow','Dublin','London','Chicago',
             'New York','San Francisco','Boston','Shetland','Cape Breton'],
'history':['traditional','trad','modern','contemporary','composed']}

{'groups': ['Altan', 'Bothy Band', 'Chieftains'],
 'history': ['traditional', 'trad', 'modern', 'contemporary', 'composed'],
 'instruments': ['fiddle',
  'flute',
  'pipe',
  'concetina',
  'accordion',
  'whistle',
  'harp',
  'bouzouki',
  'mandolin'],
 'locations': ['Antrim',
  'Armagh',
  'Carlow',
  'Cavan',
  'Clare',
  'Cork',
  'Donegal',
  'Down',
  'Fermanagh',
  'Galway',
  'Kerry',
  'Kildare',
  'Kilkenny',
  'Laois',
  'Leitrim',
  'Limerick',
  'Londonderry',
  'Derry',
  'Longford',
  'Louth',
  'Mayo',
  'Meath',
  'Monaghan',
  'Offaly',
  'Roscommon',
  'Sligo',
  'Tipperary',
  'Tyrone',
  'Waterford',
  'Wexford',
  'Wicklow',
  'Dublin',
  'London',
  'Chicago',
  'New York',
  'San Francisco',
  'Boston',
  'Shetland',
  'Cape Breton'],
 'musicians': ['Kevin Burke',
  'Michael Coleman',
  'Joe Cooley',
  "O'Carolan",
  'John Doherty']}

In [8]:

# # URL = Uniform Resource Locator
# try:
#     # For Python 3.0 and later
#     from urllib.request import urlopen
# except ImportError:
#     # Fall back to Python 2's urllib2
#     from urllib2 import urlopen
    
# from bs4 import BeautifulSoup
# import re
# from geopy.geocoders import Nominatim
# import numpy as np

# sample urls
profile_url = 'https://thesession.org/members/97793'
tunebook_url = 'https://thesession.org/members/97793/tunebook'
usersettings_url = 'https://thesession.org/members/97793/tunes'
tune_url = 'https://thesession.org/tunes/55'

In [9]:
# Looking at different user profiles to see what sort of classes
# of locations might be useful to look for

from geopy.geocoders import Nominatim
geolocator = Nominatim()

# some locations have alternatives to city/county/state

up = userProfile(1659)
location = geolocator.reverse(up.location)
print(location.raw)

print(location.raw['address']['town'])
print(location.raw['address']['county'])
print(location.raw['address']['state'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])

    
up = userProfile(1659)
location = geolocator.reverse(up.location)
print(location.raw)

print(location.raw['address']['town'])
print(location.raw['address']['county'])
print(location.raw['address']['state'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])

up = userProfile(13449)
location = geolocator.reverse(up.location)
print(location.raw)

#print(location.raw['address']['city'])print(location.raw['address']['county'])
print(location.raw['address']['locality'])
print(location.raw['address']['county'])
print(location.raw['address']['state_district'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])

up = userProfile(10078)
location = geolocator.reverse(up.location)
print(location.raw)

#print(location.raw['address']['city'])print(location.raw['address']['county'])
#print(location.raw['address']['city'])
print(location.raw['address']['village'])
print(location.raw['address']['county'])
print(location.raw['address']['state'])
print(location.raw['address']['country'])
print(location.raw['address']['country_code'])


https://thesession.org/members/1659
{u'display_name': u'3, Stationsstrasse, Niederuster, Uster, Bezirk Uster, Z\xfcrich, 8606, Schweiz/Suisse/Svizzera/Svizra', u'place_id': u'82094140', u'lon': u'8.6937239615674', u'boundingbox': [u'47.369756', u'47.3699778', u'8.6935723', u'8.6938819'], u'osm_type': u'way', u'licence': u'Data \xa9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', u'osm_id': u'48839324', u'lat': u'47.3698669', u'address': {u'town': u'Uster', u'house_number': u'3', u'country': u'Schweiz/Suisse/Svizzera/Svizra', u'county': u'Bezirk Uster', u'suburb': u'Niederuster', u'state': u'Z\xfcrich', u'postcode': u'8606', u'country_code': u'ch', u'road': u'Stationsstrasse'}}
Uster
Bezirk Uster
Zürich
Schweiz/Suisse/Svizzera/Svizra
ch
https://thesession.org/members/1659
{u'display_name': u'3, Stationsstrasse, Niederuster, Uster, Bezirk Uster, Z\xfcrich, 8606, Schweiz/Suisse/Svizzera/Svizra', u'place_id': u'82094140', u'lon': u'8.6937239615674', u'boundingbox': [u'47.