
# Working with data files

For this project, I will use data from the file `ca-GrQc.txt`, which you can locate in the data folder in repository. The file contains the co-authorship links for articles in the ArXiv category General Relativity. Each line in the file includes the ids of two authors who have worked together on at least one article. In network analysis parlance, this is known as an "edge list." The data are obtained from the [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/index.html)

Geting all coauthorships in a list of lists

Creating a list that contains all edges included in the data file as lists of the two authors' ids, where the ids are saved as integers.


In [1]:
# Open file and read lines as a list of strings
with open('../data/ca-GrQc.txt', 'r') as file:
    lines = file.readlines()

# Create an empty list
a_list = []

# Using a for loop, iterate over each elem in lines
for elem in lines[4:]:

# Create clean authors lists from each elem 
    authors = elem.strip().split('\t')

# Convert the strings in authors to integer type and add to a_list
    a_list.append([int(authors[0]), int(authors[1])])

print(a_list[:10])


[[3466, 937], [3466, 5233], [3466, 8579], [3466, 10310], [3466, 15931], [3466, 17038], [3466, 18720], [3466, 19607], [10310, 1854], [10310, 3466]]


### Who are the authors in the data?

- Creating a sorted list with the integer ids for all of the unique authors in the dataset.
- Using a dictionary comprehension, creating a dictionary in which the keys are the author integer ids and the values are empty lists.

In [2]:
# Use a list comprehension to create a list of edges in a_list
in_edge_list = [in_edge for edge in a_list for in_edge in edge]

# Convert c_list to a set then convert it to a list again
# A set contains unique elements but can't be sliced
unique_list = list(set(in_edge_list))

print(unique_list[:10])
print(len(unique_list))

# Use a dictionary comprehension to create a dictionary 
# where the authors' id is the key and [] is the value
dic = {id : [] for id in unique_list}

# To confirm, I indexed the first 10 keys
test_dic = {id : [] for id in unique_list[:10]}
print(test_dic)

[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]
5242
{13: [], 14: [], 22: [], 24: [], 25: [], 26: [], 27: [], 28: [], 29: [], 45: []}


### Getting each author's coauthors

Note: I noticed that the data contain errors. For example, I noticed that the data say that author 13 coauthored with himself/herself, which is meaningless. To get the maximum number of points, my code excludes oneself in one's list of coauthors.

In [3]:
# Use a for loop to iterate over the edges in a_list
for edge in a_list:

# If in_edges are different
        if edge[0] != edge[1]:

# Add the in_edges in a_list to the [] values in dic
# and the dictionary is formed
                dic[edge[0]].append(edge[1])

# Create seperate lists for keys and values in dic
key_list = list(dic.keys())
values_list = list(dic.values())

# Use a list comprehension to create a list of the authors' seperate idens
target_list = [ iden for value in values_list[:10] for iden in value]

# Convert the list into a set that returns unique elements
# then to a list again
print(list(set(target_list)))


[25346, 21508, 773, 12679, 2952, 19081, 20108, 3858, 20243, 7956, 21012, 24726, 11801, 15003, 17692, 20635, 15774, 4511, 18719, 4513, 1186, 6179, 2212, 21281, 22691, 25758, 13096, 15659, 7596, 3372, 6830, 11183, 8879, 15793, 18866, 12851, 2741, 9785, 570, 11196, 19517, 25540, 4164, 4550, 14540, 12365, 18894, 19961, 11472, 12496, 6610, 25043, 4180, 20562, 13142, 14807, 21847, 22618, 14171, 19423, 19170, 22887, 11241, 106, 22891, 11114, 7916, 12781, 19440, 1653, 17655, 23161, 24955, 23293, 1407]


### Finding the author who has the most coauthors

In [4]:

#Create a list of the lengths of the values in dic
len_list = [len(dic[key]) for key, value in dic.items()]
print(max(len_list))

# Iterate through the keys in dic
for key in dic:
# Add the condition that length of value corresponding 
# to iterable key is equal to the max length in len_list
    if len(dic[key]) == max(len_list):
        print(key) 

print('Author 21012 has 81 coauthors, the highest number of coauthors.')

81
21012
Author 21012 has 81 coauthors, the highest number of coauthors.
