# Amazon Co-purchase Network
This notebook contains the code for extracting and cleaning the amazon data.
##### Note: The location of data files given here local to our machine. We have submitted files separately. In order to execute this notebook you need to copy those files and provide their location in the required places.

In [39]:
# All library imports
import networkx as nx
import csv
from operator import itemgetter
import community #This is the python-louvain package we installed.

## Data Extraction
First of all the selected attributes are extraced from the metadata. Then the edgelist is created using a conversion function.
Here it is shown for March data but has to be repeated for rest months.

In [40]:
# This function extracts the list of atributes for each product
# from the meta-data which we will be using for our analysis
def extract_product_attributes_nodes(in_file,out_file):
    #Read all lines of the meta data into content list.
    fname = in_file #input metadata .txt file
    with open(fname, encoding = 'utf8') as f:
        content = f.readlines()
    #Remove the beginning and trailing white spaces.
    content = [x.strip() for x in content] 

    # Write extracted information to testfile.txt in a format of ',' demilited files.
    # The columns are Id, title, group, categories, totalreviews, avgrating.
    # The code stores all extracted information about a product into previoulines,
    # and write the content into file only when all information are available. Hence,
    # if review information for a product is not available, the product won't appear
    # in the final file.
   
    file = open(out_file,"w", encoding='utf8') # output file with exracted attributes
    previouslines = ['Id', 'title', 'group', 'categories', 'totalreviews', 'avgrating']
    for line in content:
        lines = line.split(':')
        if lines[0] == "Id":
            if (len(previouslines) == 6):
                for component in previouslines[0:5]:
                    file.write(component)
                    file.write(',')
                file.write(previouslines[5])
                file.write("\n")
            previouslines = []
            previouslines.append(lines[1].strip())

        if lines[0] == "title":
            title = ':'.join(lines[1:]).strip().replace(',', ' ').replace('\n', ' ').strip()
            previouslines.append(title)

        if lines[0] == "group":
            previouslines.append(lines[1].strip())

        if lines[0] == "categories":
            previouslines.append(lines[1].strip())

        if lines[0] == "reviews" and lines[1].strip() == "total":
            previouslines.append(lines[2].split(' ')[1])
            previouslines.append(lines[4].strip())
    file.close()

####  Function converts the text file to .CSV file for creation of list of nodes and edges for the graph

In [1]:
# This function converts the text file to .CSV file for creation of list of nodes and edges for the graph
def convert_txt_csv(i_file,o_file,final_csv):
    with open(i_file, 'r', encoding='utf8', newline='') as in_file:
        stripped = (line.strip() for line in in_file)
        lines = (line.split(",") for line in stripped if line)
        with open(o_file, 'w', encoding='utf8',newline='') as out_file:
            writer = csv.writer(out_file)
            writer.writerows(lines)
    # read tab-delimited file
    with open(o_file,'rt', encoding='utf8',newline='') as fin:
        cr = csv.reader(fin, delimiter='\t')
        filecontents = [line for line in cr]

    # write comma-delimited file (comma is the default delimiter)
    with open(final_csv,'w', encoding='utf8',newline='') as fou:
        cw = csv.writer(fou, quotechar='', quoting=csv.QUOTE_NONE, escapechar='\\')
        cw.writerows(filecontents)
    

In [42]:
# Invoke above defined functions
extract_product_attributes_nodes('C:/Sushmita/MS DataScience/NetworkScience/Project/Data/amazon-meta/amazon-meta.txt',
                                 'C:/Sushmita/MS DataScience/NetworkScience/Project/Results/graph/Amazon_March/testfile.txt')


In [43]:
# Forms the edgelist for March
convert_txt_csv('C:/Sushmita/MS DataScience/NetworkScience/Project/Results/graph/Amazon_March/edgelist.txt',
                'C:/Sushmita/MS DataScience/NetworkScience/Project/Results/graph/Amazon_March/edgelist.tsv',
                'C:/Sushmita/MS DataScience/NetworkScience/Project/Results/graph/Amazon_March/edgelist.csv')

## Creating Graph
Using the nodelist and edgelist from above cells graph has to be created for each month. Again here we have done for only March.

In [46]:
# This function creates the graph based on the nodelist and edgelist created by above functions
def create_graph(node_file,edge_file):
    group_dict = {}
    names_dict ={}
    review_dict = {}
    rating_dict ={}
    G = nx.Graph()
    with open(node_file, 'r',  encoding="utf8",newline='') as nodecsv: # Open the file                       
        nodereader = csv.reader(nodecsv) # Read the csv  
        # Retrieve the data (using Python list comprhension and list slicing to remove the header row, see footnote 3)
        nodes = [n for n in nodereader][1:]  
    node_names = [n[0] for n in nodes]
    with open(edge_file, 'r',  encoding="utf8",newline='') as edgecsv: # Open the file
        edgereader = csv.reader(edgecsv) # Read the csv     
        edges = [tuple(e) for e in edgereader][1:] # Retrieve the data
        
    for node in nodes: # Loop through the list, one row at a time
        names_dict[node[0]] = node[1]
        group_dict[node[0]] = node[2]
        review_dict[node[0]] = node[4]
        rating_dict[node[0]] = node[5]
        
    
    G.add_nodes_from(node_names)
    G.add_edges_from(edges)
    nx.set_node_attributes(G, names_dict, 'Title')
    nx.set_node_attributes(G, group_dict, 'Group')
    nx.set_node_attributes(G, review_dict, 'Reviews')
    nx.set_node_attributes(G, rating_dict, 'AvgRating')
    return G

In [47]:
G_March = create_graph('C:/Sushmita/MS DataScience/NetworkScience/Project/Results/graph/Amazon_March/testfile.txt',
                    'C:/Sushmita/MS DataScience/NetworkScience/Project/Results/graph/Amazon_March/edgelist.csv')

## Data Cleaning
For better visualization isolates have been removed and large connected component has been extracted. Below shows for March and has to be repreated for other months.

In [48]:
remove_list=list(nx.isolates(G_March))
G_March.remove_nodes_from(remove_list)

In [49]:
import linkcom

large_component=max(nx.connected_component_subgraphs(G_March), key=len)

# Save the graph to file
nx.write_gml(large_component, "C:/Sushmita/MS DataScience/NetworkScience/Project/Results/graph/Amazon_March/march_large.gml")