## [RQ5] The most influential users are the ones with the highest number of “followers", you can now look more into their activity.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now we will import only the columns we are interested in of the data set "instagram_posts.csv".\
Then we will import the other two dataset "instagram_locations.csv" and "instagram_profiles.csv".\
Later we will fill all the NA values that we are interested because otherwise we can find some problems with operations.

In [None]:
cols=pd.read_csv("instagram_posts.csv",delimiter="\t",nrows=1).columns.values.tolist()
ignore_col=("description","sid","cts","id")
posts = pd.read_csv("instagram_posts.csv",delimiter="\t",parse_dates=[5],infer_datetime_format=True, usecols=[x for x in cols if x not in ignore_col])

In [None]:
locations = pd.read_csv("instagram_locations.csv",delimiter="\t")
profiles = pd.read_csv("instagram_profiles.csv",delimiter="\t")

In [None]:
posts.profile_id.fillna(0,inplace=True)
posts.location_id.fillna(0,inplace=True)
posts.numbr_likes.fillna(0,inplace=True)
posts.number_comments.fillna(0,inplace=True)

In [None]:
profiles.following.fillna(0,inplace=True)
profiles.followers.fillna(0,inplace=True)
profiles.n_posts.fillna(0,inplace=True)
profiles.description.fillna("",inplace=True)
profiles.firstname_lastname.fillna("",inplace=True)
profiles.url.fillna("",inplace=True)
profiles.cts.fillna("",inplace=True)
profiles.is_business_account.fillna(False,inplace=True)
profiles.profile_id.fillna(0,inplace=True)

In [None]:
locations.sid.fillna(0,inplace=True)
locations.id.fillna(0,inplace=True)
locations.name.fillna('',inplace=True)
locations.street.fillna('',inplace=True)
locations.city.fillna('unknown location',inplace=True)
locations.region.fillna('',inplace=True)
locations.cd.fillna('',inplace=True)
locations.phone.fillna('',inplace=True)
locations.dir_city_name.fillna('',inplace=True)
locations.dir_city_slug.fillna('',inplace=True)
locations.dir_country_id.fillna('',inplace=True)
locations.dir_country_name.fillna('',inplace=True)
locations.website.fillna('',inplace=True)
locations.primary_alias_on_fb.fillna('',inplace=True)

### 5.1 Plot the top 10 most popular users in terms of followers and their number of posts.

In [None]:
profiles_foll = profiles.sort_values(by='followers',ascending=False)
display(profiles_foll.head(10))

In [None]:
plt.plot(profiles_foll.followers,profiles_foll.n_posts)
plt.xlabel('Followers')
plt.ylabel('Number of posts')
plt.show()

In [None]:
plt.plot(profiles_foll.head(10).followers,profiles_foll.head(10).n_posts)
plt.xlabel('Followers')
plt.ylabel('Number of posts')
plt.show()

In contrast of what we expect, higher number of followers does not imply higher number of posts.\
The higher number of posts published is given by the ones who have less followers.\
This can be an indicator of how top influencer prefer to publish less but with more quality.\
We should anyway take in account that the spike in posts can be given by bots.

### 5.2 Who is the most influential user?

In [None]:
display(profiles_foll.head(1))

### 5.3 Have they posted anything with tagged locations? Extract the most frequent areas on their posts and plot the number of times each city has been visited.

As we have done in the [RQ4] point, we define a function that returns all the posts of the top n profiles in terms of followers.

In [None]:
def top_n_users_loc(n):
    index = set()
    top_n_profiles = profiles.sort_values(by=['followers'], ascending = False).head(n)
    for i in range(n):
        index.add(top_n_profiles.iloc[i,:]["profile_id"])
    new_dataset = posts[posts["profile_id"].isin(index)]
    return new_dataset.sort_values(by="profile_id",ascending=False)

Now that we have all the posts of the most influential profiles (i.e. those with the most followers), we can derive the location ids from each individual post.

In [None]:
locations_top = top_n_users_loc(10)["location_id"].tolist()
locations_top[:] = [x for x in locations_top if x != 0.0]

Then we can check how many posts (in general) have been published by entering the location in these 'famous' cities.

In [None]:
position_count = posts[posts['location_id'].isin(locations_top)].groupby('location_id').count().reset_index()
position_count = position_count[['location_id', 'post_id']]
position_count.columns = ['location_id', 'counts']
number = position_count.counts.tolist()

In [None]:
cities_in_order = []
for i in range(position_count.shape[0]):
    city = locations[locations['id']==position_count.iloc[i,:]["location_id"]].city.str.split(',',expand=True)
    city = city.iloc[0].tolist()[0]
    cities_in_order.append(city)

In [None]:
y = number
x = cities_in_order
fig, ax = plt.subplots()
ax.scatter(x, y)

for i, txt in enumerate(x):
    ax.annotate(txt, (x[i], y[i]))

Since is very difficult to read this plot it since the cities are really too much we can display it in order.\

In [None]:

ordered_cities_for_visit= (list(zip(cities_in_order,number)))
ordered_cities_for_visit.sort(key=lambda x:x[1])
ordered_cities_for_visit.reverse()
ordered_cities_for_visit

### 5.4 How many pictures-only posts have they published? How many reels? (only videos) and how many with both contents? Provide the number as percentages and interpret those figures.

In [None]:
top = top_n_users_loc(10)
perc_photos = top[top["post_type"]==1].shape[0]/top.shape[0]
perc_only_videos = top[top["post_type"]==2].shape[0]/top.shape[0]
perc_multy = top[top["post_type"]==3].shape[0]/top.shape[0]

In [None]:
print("The percentage of posts that include only photos in the top users is: ",perc_photos)
print("The percentage of posts that include only videos in the top users is: ",perc_only_videos)
print("The percentage of posts that include both photos and videos in the top users is: ",perc_multy)

In [None]:
y = np.array([perc_photos,perc_only_videos,perc_multy])
mylabels = ["Only photos", "Only videos","Multy"]
plt.pie(y,labels = mylabels,autopct='%1.2f%%')
plt.show()

As we can notice the great mayority of posts published by the famous accounts are only with photos.\
Maybe the data were collected before the advent of "tiktok", which increased the publication of video-only posts.\
Photos may garner more engagement on platforms where users scroll through media quickly,\
 while videos may be more successful on platforms where users are actively seeking out specific or detailed content.\
So makes totally sense that users on Instagram prefer to publish posts with only photos.


### 5.5 How many "likes" and comments did posts with only pictures receive? How about videos and mixed posts? Try to provide the average numbers and confront them with their followers amount, explaining what you can say from that comparison.

We convert the column of "numbr_likes" and "number_comments" in a numpy array to perform operations more efficiently.

In [None]:
list_likes_1 = top[top["post_type"]==1]["numbr_likes"].tolist()
list_likes_1 = np.array(list_likes_1)
mean_likes_1 = np.mean(list_likes_1.astype(float))
list_comm_1 = top[top["post_type"]==1]["number_comments"].tolist()
list_comm_1 = np.array(list_comm_1)
mean_comm_1 = np.mean(list_comm_1.astype(float))
print("This is the mean of the numbers of likes of the posts wuth only photos of the 10 best profiles(in terms of followers):",mean_likes_1)
print("This is the mean of the numbers of comments of the posts wuth only photos of the 10 best profiles(in terms of followers):",mean_comm_1)

In [None]:
list_likes_2 = top[top["post_type"]==2]["numbr_likes"].tolist()
list_likes_2 = np.array(list_likes_2)
mean_likes_2 = np.mean(list_likes_2.astype(float))
list_comm_2 = top[top["post_type"]==2]["number_comments"].tolist()
list_comm_2 = np.array(list_comm_2)
mean_comm_2 = np.mean(list_comm_2.astype(float))
print("This is the mean of the numbers of likes of the posts wuth only videos of the 10 best profiles(in terms of followers):",mean_likes_2)
print("This is the mean of the numbers of comments of the posts wuth only videos of the 10 best profiles(in terms of followers):",mean_comm_2)

In [None]:
list_likes_3 = top[top["post_type"]==3]["numbr_likes"].tolist()
list_likes_3 = np.array(list_likes_3)
if len(list_likes_3!=0):
    mean_likes_3 = np.mean(list_likes_3.astype(float))
else:
    mean_likes_3=0
list_comm_3 = top[top["post_type"]==3]["number_comments"].tolist()
list_comm_3 = np.array(list_comm_3)
if len(list_comm_3!=0):
    mean_comm_3 = np.mean(list_comm_3.astype(float))
else:
    mean_comm_3=0
print("This is the mean of the numbers of likes of the posts with both photos and vides of the 10 best profiles(in terms of followers):",mean_likes_3)
print("This is the mean of the numbers of comments of the posts with both photos and vides of the 10 best profiles(in terms of followers):",mean_comm_3)

Now we evaluate the mean of followers of the top 10 profiles and we compare the result obtained 

In [None]:
top_n_profiles = profiles.sort_values(by=['followers'], ascending = False).head(10)
mean_followers = top_n_profiles["followers"].mean()

In [None]:
print("This is the division between the mean of the likes of the posts with photos and mean of followers:",mean_likes_1/mean_followers)
print("This is the division between the mean of the likes of the posts with videos and mean of followers:",mean_likes_2/mean_followers)
print("This is the division between the mean of the likes of the posts with both photos and videos and mean of followers:",mean_likes_3/mean_followers)
print("This is the division between the mean of the comments of the posts with photos and mean of followers:",mean_comm_1/mean_followers)
print("This is the division between the mean of the comments of the posts with videos and mean of followers:",mean_comm_2/mean_followers)
print("This is the division between the mean of the comments of the posts with both photos and videos and mean of followers:",mean_comm_3/mean_followers)

Since the results are very small we can deduce that the followers of this top 10 profiles are not so genuine.\
Since these profiles are very famous it's possible that a huge numbers of bots follow the accounts,\
 and of course they don't generate any type of interactions(likes or comments).\
The posts with only photos generate more interations than the ones with only videos or multy.\
This information can be useful for someone who have the priorities to generate more likes or comments to became a trend in the researches of Instagram.
