# New York Philharmonic

## Due Thursday, May 12 at 8 AM

In this lab, you will analyze XML data of every one of the [New York Philharmonic](http://www.nyphil.org)'s concerts between 1963 and 1973. The data resides in the file `/data/nyphil.xml`.

Note that the same program may have been used for several concerts. So the number of times a work was _programmed_ might be different from the number of times it was _performed_.

You are highly encouraged to skim the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Unlike most documentation, it's concise and organized!

## Question 0 (5 points)

Use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to read the XML data into a Python object. Store the Python object called `data`. Make sure the tests below run without any errors, as this question is autograded.

In [1]:
from bs4 import BeautifulSoup
fp = open("/data/nyphil.xml")
data = BeautifulSoup(fp, "xml")

In [2]:
root = list(data.children)[0]
assert(root.name == "programs")
assert(sum(1 for _ in root.children) == 1931)

## Question 1 (15 points)

Which works (please give composer and title) were programmed the most times over this time period? (No explanation necessary; just print out the top works, alongside the counts of how many programs they appeared on.)

In [3]:
import pandas as pd

composer_tags = data.find_all("composerName")
worktitle_tags = data.find_all("workTitle")

composer_list_str = []
worktitle_list_str = []

for composer in composer_tags:
    
    composer_list_str.append(composer.string)
    
for worktitle in worktitle_tags:
    
    worktitle_list_str.append(worktitle.string)
    
works_list = list(zip(composer_list_str,worktitle_list_str))


works_dict = {} # has (ComposerName, Work Title) :  of times a work was programmed

for work in works_list:
    
    if work not in works_dict:
        works_dict[work] = 1
    else:
        works_dict[work] += 1
    
num_times_ea_work_programmed_df = pd.Series(works_dict).to_frame()
num_times_ea_work_programmed_df.columns = ['Number of Times a Work Was Programmed']
num_times_ea_work_programmed_df = num_times_ea_work_programmed_df.reset_index()
num_times_ea_work_programmed_df = num_times_ea_work_programmed_df.sort_values(by = "Number of Times a Work Was Programmed", ascending = False)
num_times_ea_work_programmed_df = num_times_ea_work_programmed_df.rename(columns = {'index': "(Composer Name, Work Title)" }).reset_index() 
num_times_ea_work_programmed_df.ix[0:10]

Unnamed: 0,index,"(Composer Name, Work Title)",Number of Times a Work Was Programmed
0,982,"(Gershwin, George, PORGY AND BESS)",106
1,559,"(Berlioz, Hector, DAMNATION DE FAUST, LA, OP....",42
2,1269,"(Prokofiev, Sergei, ROMEO AND JULIET: SUITE N...",41
3,111,"(Mahler, Gustav, KNABEN WUNDERHORN, DES (12 S...",35
4,106,"(Stravinsky, Igor, FIREBIRD: SUITE (1919 VERS...",33
5,378,"(Berlioz, Hector, SYMPHONIE FANTASTIQUE, OP.14)",33
6,301,"(Musorgsky, Modest, PICTURES AT AN EXHIBITION...",30
7,1007,"(Berlioz, Hector, ROMAN CARNIVAL OVERTURE (LE...",29
8,510,"(Mendelssohn, Felix, SYMPHONY NO. 4, A MAJOR,...",28
9,1073,"(Schuller, Gunther, SEVEN STUDIES ON THEMES O...",27


## Question 2 (20 points)

Which works (please give composer and title) were performed the most times over this time period? (No explanation necessary; just print out the top works, alongside the counts of how many times they were performed.)

*look at concertinfo. Some programs had multiple concertinfo tags, which meant that each work under a program was performed n(concertinfo) # of times.

In [4]:

# composer_tags = data.find_all("composerName")
# worktitle_tags = data.find_all("workTitle")


# composer_list_str = []
# worktitle_list_str = []

# for composer in composer_tags:
    
#     composer_list_str.append(composer.string)
    
# for worktitle in worktitle_tags:
    
#     worktitle_list_str.append(worktitle.string)
    
# works_list = list(zip(composer_list_str,worktitle_list_str))



program_tags = data.find_all("program")
works_dict2 = {}

for program in program_tags:
    
    concert_infos_list_per_ID = program.find_all("concertInfo")
    num_performances_per_ID = len(list(concert_infos_list_per_ID) )
    
    
    
    for work in program.find_all("work"):
    
        if work['ID'] not in works_dict2 and work['ID'] != "0*":
            works_dict2[work['ID']] = num_performances_per_ID
        else:
            if work['ID'] != "0*":
                works_dict2[work['ID']] += num_performances_per_ID
            


num_performances_per_work = pd.Series(works_dict2).to_frame()

num_performances_per_work.columns = ['Number of Performances Per Work']
num_performances_per_work = num_performances_per_work.reset_index()
num_performances_per_work = num_performances_per_work.sort_values(by = 'Number of Performances Per Work', ascending = False)
num_performances_per_work = num_performances_per_work.rename(columns = {'index': "Work ID" })
num_performances_per_work = num_performances_per_work.reset_index()
num_performances_per_work.ix[0:10]
# performance_count_per_program = [] #num concertinfo's per id, DONT PUT IT IN LIST OR WILL HAVE INDEXING COMPLEXITIES

# for program in program_tags:
    
#     concert_infos_list_per_ID = list(program.find_all("concertInfo"))
#     num_performances_per_ID = len(concert_infos_list_per_ID)
    
#     performance_count_per_program.append(num_performances_per_ID)
    
# performance_count_per_program


Unnamed: 0,index,Work ID,Number of Performances Per Work
0,1325,53624*,51
1,803,51370*,45
2,308,3395*,44
3,844,51581*,41
4,1160,52644*,41
5,937,51884*,38
6,1139,52577*,37
7,1754,715*,37
8,553,50064*,36
9,1109,52453*,35


## Question 3a (20 points)

Make a Pandas DataFrame, where each row is a work that was programmed by the New York Philharmonic. The columns should include the composer, work title, conductor, and the date of the first performance of that program. (Hint: You may want to look at the Pandas DataFrame to remind yourself about how to use `pd.DataFrame` to create a DataFrame from a dict.)

Please print out the first few rows of your DataFrame.

In [5]:
program_tags = data.find_all("program")

# test_dict = {
    
#     'composer':[1,2],
#     'work title':[3,4],
#     'conductor':[5,6]
    
# }

# df3a_test = pd.DataFrame(test_dict)
# #keys of dict become columns(composer,worktitle, conductor...)
# #values of dict are lists that contain varying content of the column, such as workID's

# #each row has pieces of info that represent the work id

# #list in values from dict must contain all the composers or conducts , etc..should be global to for loop

# df3a_test

works_info_data = {}
composerNames_list = []
workTitles_list = []
conductorNames_list = []

#first, put each info from works into sep lists:
for program in program_tags:
    
    #find smallest timestamp
#     strDates = []
#     dates = list(program.find_all("Date"))
    
#     for tag in dates:
        
#         strDates.append(tag.string)
    
#     earliest_date = min(dates)
#     print(dates)


    for work in program.find_all("work"):
        
        if work['ID'] != '0*':
            
            for work_detail in work.children:
                
                if work_detail.name == 'composerName':
                    composerNames_list.append(work_detail.string)
                elif work_detail.name == 'workTitle':
                    workTitles_list.append(work_detail.string)
                elif work_detail.name == 'conductorName':
                    conductorNames_list.append(work_detail.string)
                    
input_dict = {}
dummy_list = ['N/A'] * 198 # used solely to make size of smaller arrays = size of largest array for creating dataframe
#creating an array of na's that is size 198

conductorNames_list.extend(dummy_list)

input_dict['composerNames'] = composerNames_list
input_dict['workTitles'] = workTitles_list
input_dict['conductorNames'] = conductorNames_list

#can't create dataframe if arrays are all different sizes..fixed
df_3a = pd.DataFrame(input_dict)
df_3a

# print(len(composerNames_list)) #2 sizes smaller, must fill in rest to get same length
# print(len(workTitles_list))
# print(len(conductorNames_list))

Unnamed: 0,composerNames,conductorNames,workTitles
0,"Brahms, Johannes","Bernstein, Leonard","ACADEMIC FESTIVAL OVERTURE, OP.80"
1,"Brahms, Johannes","Bernstein, Leonard","SYMPHONY NO. 4 IN E MINOR, OP. 98"
2,"Brahms, Johannes","Bernstein, Leonard","CONCERTO, VIOLIN AND CELLO, OP. 102 (DOUBLE)"
3,"Mendelssohn, Felix","Bernstein, Leonard","SYMPHONY NO. 4, A MAJOR, OP. 90 (ITALIAN)"
4,"Stravinsky, Igor","Ozawa, Seiji",FIREBIRD: SUITE (1919 VERSION)
5,"Beethoven, Ludwig van","Bernstein, Leonard","SYMPHONY NO. 5 IN C MINOR, OP.67"
6,"Ravel, Maurice","Bernstein, Leonard",ALBORADA DEL GRACIOSO
...,...,...,...
4655,"Beethoven, Ludwig van",,"SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)"
4656,"Bernstein, Leonard",,DYBBUK VARIATIONS


## Question 3b (10 points)

Use the DataFrame you created above to determine Leonard Bernstein's favorite composers. That is, which composers appeared on the most programs where Bernstein was conducting?

In [6]:
df_3a_filterby_Bernstein = df_3a.ix[(df_3a['conductorNames'] == "Bernstein, Leonard") ] #get only rows where bernstein was conducting
df_3b = df_3a_filterby_Bernstein.groupby('composerNames')['composerNames'].count().to_frame()
df_3b.columns = ["Num Times Composer Appeared where Bernstein was conducting"]
df_3b = df_3b.sort_values(by = "Num Times Composer Appeared where Bernstein was conducting", ascending = False).reset_index()
df_3b.ix[0:2]

Unnamed: 0,composerNames,Num Times Composer Appeared where Bernstein was conducting
0,"Beethoven, Ludwig van",51
1,"Mahler, Gustav",37
2,"Tchaikovsky, Pyotr Ilyich",33


Bernstein's top three favorite composers: Beethoven, Mahler, Tchaikovsky

## Question 4 (20 points)

For each composer, calculate the number of programs that featured one (or more) of his works.  Sort the composers in descending order of the number of programs in which they appeared.

**Think:** Why can't you just call `.groupby("composer").count()` on your Pandas DataFrame from the previous question?

In [7]:
#check if program contains target composer name.
#if so, map the composer name and count to a dict or if already in dict, increment count by one

program_tags = data.find_all("program")

num_programs_featuring_composerswork_dict = {}

composerNames_noduplicates = list(set(composerNames_list))

for composer in composerNames_noduplicates:
    
    #print(composer)
    
    for program in program_tags:
        
        comp_names_inprogram = program.find_all('composerName')
        
        for comp_name_tag in comp_names_inprogram:
            
            if composer == comp_name_tag.string:
        
                if composer not in num_programs_featuring_composerswork_dict:
                    num_programs_featuring_composerswork_dict[composer] = 1
                else:
                    num_programs_featuring_composerswork_dict[composer] += 1



In [8]:
num_programs_featuring_composerswork_df = pd.Series(num_programs_featuring_composerswork_dict).to_frame().reset_index()
num_programs_featuring_composerswork_df.columns = ['composerNames', "number of programs that featured one (or more) of composer's works"]
num_programs_featuring_composerswork_df = num_programs_featuring_composerswork_df.sort_values(by = "number of programs that featured one (or more) of composer's works", ascending = False )
num_programs_featuring_composerswork_df = num_programs_featuring_composerswork_df.reset_index()
num_programs_featuring_composerswork_df

Unnamed: 0,index,composerNames,number of programs that featured one (or more) of composer's works
0,16,"Beethoven, Ludwig van",228
1,208,"Mozart, Wolfgang Amadeus",217
2,308,"Tchaikovsky, Pyotr Ilyich",214
3,107,"Gershwin, George",164
4,301,"Stravinsky, Igor",158
5,22,"Berlioz, Hector",155
6,36,"Brahms, Johannes",124
...,...,...,...
334,235,"Quilter, Roger",1
335,127,"Guarnieri, Camargo",1
