1. `agressivite:`
- Heures totales calculées : le nombre total d'heures pendant lesquelles un utilisateur a été actif est calculé sur la base des horodatages du premier et du dernier tweets.

- Fréquence des tweets (ftweets) : le nombre de tweets divisé par le nombre total d'heures d'activité.

- Fréquence des suivis (ffriends) : l'évolution du nombre d'amis divisée par le nombre total d'heures d'activité, ce qui reflète approximativement la fréquence à laquelle l'utilisateur suit d'autres personnes.

- Calcul du score d'agressivité : la somme de la fréquence des tweets et des suivis (nouveaux fans) par heure, divisée par le nombre maximum d'actions possibles par heure (350).

In [5]:
import json
from datetime import datetime

def calculate_aggressiveness(data, f_max = 350):
    user_scores = []

    for user_data in data:
        user_id = user_data['_id']
        tweets = user_data['tweets']

        if not tweets:
            user_scores.append({
                'user_id': user_id,
                'aggressiveness_score': 0
            })
            continue

        start_time = min(int(tweet['timestamp_ms']) for tweet in tweets)
        end_time = max(int(tweet['timestamp_ms']) for tweet in tweets)
        total_hours = (end_time - start_time) / (1000 * 60 * 60) if (end_time - start_time) > 0 else 1

        f_tweets = len(tweets) / total_hours

        initial_friends = tweets[0]['user']['friends_count']
        final_friends = tweets[-1]['user']['friends_count']
        f_friends = (final_friends - initial_friends) / total_hours
        
        aggressiveness_score = (f_tweets + f_friends) / f_max
        
        user_scores.append({
            'user_id': user_id,
            'aggressiveness_score': aggressiveness_score
        })

    return user_scores

2. 

-  `avg_retweet`(Retweets moyens) : calcule la moyenne du nombre de retweets pour tous les tweets d'un utilisateur.

-  `avg_url`(Nombre moyen d'URL) : calcule la moyenne du nombre d'URL inclus dans tous les tweets d'un utilisateur.

-  `avg_hashtag`(Nombre moyen de hashtags) : calcule la moyenne du nombre de hashtags inclus dans tous les tweets d'un utilisateur.

In [6]:
import json

def calculate_averages(data):
    user_averages = []

    for user_data in data:
        user_id = user_data['_id']
        tweets = user_data['tweets']

        total_retweets = 0
        total_urls = 0
        total_hashtags = 0
        tweet_count = len(tweets)

        for tweet in tweets:
            total_retweets += tweet['retweet_count']
            total_urls += len(tweet['entities']['urls'])
            total_hashtags += len(tweet['entities']['hashtags'])

        if tweet_count > 0:
            avg_retweet = total_retweets / tweet_count
            avg_url = total_urls / tweet_count
            avg_hashtag = total_hashtags / tweet_count
        else:
            avg_retweet, avg_url, avg_hashtag = 0, 0, 0
        
        user_averages.append({
            'user_id': user_id,
            'avg_retweet': avg_retweet,
            'avg_url': avg_url,
            'avg_hashtag': avg_hashtag
        })

    return user_averages

3. `rationFollowersFriends`: Le "ratio followers/friends" d'un utilisateur.

In [13]:
def calc_ration_followers_friends(data):
    user_ids = []
    user_ratios = []

    for user_tweets in data:
        user_id = user_tweets['_id']
        friends_count = user_tweets['tweets'][0]['user']['friends_count']
        followers_count = user_tweets['tweets'][0]['user']['followers_count']
        user_ids.append(user_id)
        ratio = followers_count / friends_count if friends_count != 0 else 0  # 避免除以零错误

        user_ratios.append({
            'user_id': user_id,
            'ratio': ratio
        })
    return user_ratios

4. `mediumLength`：Longueur moyenne des tweets des utilisateurs

In [17]:
def calc_avg_tweet_length(data):
    user_ids = []
    avg_tweet_lengths = []

    for user_tweets in data:
        user_id = user_tweets['_id']
        total_tweet_length = 0
        num_tweets = len(user_tweets['tweets'])
        for tweet in user_tweets['tweets']:
            total_tweet_length += len(tweet['text'])
        avg_tweet_length = total_tweet_length / num_tweets if num_tweets > 0 else 0
        user_ids.append(user_id)
        avg_tweet_lengths.append({
            'user_id': user_id,
            'avg_tweet_length': avg_tweet_length
        })

    return avg_tweet_lengths

5. `rateOfRepliedTweets`:Pourcentage de tweets ayant reçu une réponse dans une liste de tweets donnée

In [24]:
def calc_rate_replied_tweets(tweets):
    total_replied = 0
    totalTweets = len(tweets)
    for tweet in tweets:
        if tweet['reply_count']:
            total_replied += 1
    res = total_replied * 100 / totalTweets
    return res

def calc_rate_replied(data):
    replied_rate = []
    for user in data:
        tweets = user['tweets']
        res = calc_rate_replied_tweets(tweets)
        replied_rate.append({
            'user_id': user['_id'],
            'replied_rate': res
        })
    return replied_rate

6. `tweet_per_day`:Nombre moyen de tweets par jour

In [27]:
import pandas as pd
import numpy as np
def get_dates(t):
    date = t['created_date']
    date = datetime.strptime(date, '%Y-%m-%d %H:%M:%S')
    return date

def calc_tweet_frequency(tweets):
    dates = [get_dates(t) for t in tweets]
    serie_dates = pd.Series(dates)
    frequency = np.mean(serie_dates.dt.date.value_counts())
    return frequency

def calc_tweet_frequency_per_day(data):
    frequency_list = []
    for user in data:
        tweets = user['tweets']
        frequency = calc_tweet_frequency(tweets)
        frequency_list.append({ 'user_id': user['_id'], 'frequency': frequency })
    return frequency_list

7. `verified`:Vérifier si un utilisateur ou un utilisateur d'une liste de tweet est authentifié<br>
8. `visibility`： est un score qui mesure la visibilité des utilisateurs de médias sociaux, calculé en comptant la fréquence d'utilisation du symbole de mention (@) et du hashtag (#) dans les tweets des utilisateurs, normalisé par l'addition de leurs coûts moyens.

In [31]:
# 用户验证和可见度列表
verific_et_visibil_list = []

# 检查是否验证 返回True 或 False
def verified_check(user_infor):
    return user_infor['verified']

def visibility_compute(user_text):
    costMention = 11.4  # C(@)
    costHashtag = 11.6  # C(#)
    nbMention = 0
    nbHashtag = 0
    for text in user_text:

        nbMention += text.count("@")
        nbHashtag += text.count("#")

    avgMention = nbMention/len(user_text) * costMention
    avgHashtag = nbHashtag/len(user_text) * costHashtag

    visibility = ( avgMention + avgHashtag ) / 140
    return visibility

def visibility_et_verified_check(data):
    # 处理JSON数据中的每个用户
    for user_data in data:
        user_tweets = user_data['tweets']
        # 根据用户信息， 检查是否验证
        user_infor = user_tweets[0]['user']
        verification = verified_check(user_infor)
        # 根据用户推文， 计算可见度
        user_text = user_tweets[0]['text']
        visibility = visibility_compute(user_text)

        # 将结果追加到对应数组
        verific_et_visibil_list.append({
            'user_id': user_data['_id'],
            'verification': verification,
            'visibility': visibility
        })

    return verific_et_visibil_list

9. `accountAge`(days): Âge des comptes d'utilisateurs<br>
10. `avg_fav`:Calculer le taux de recouvrement moyen

In [33]:
# 定义函数计算账户年龄（天数）
def get_account_age(creation_date):
    creation_date_formatted = datetime.strptime(creation_date, '%a %b %d %H:%M:%S +0000 %Y')
    today_date = datetime.now()
    account_age = (today_date - creation_date_formatted).days
    return account_age


# 定义函数计算平均收藏数比
def get_avg_fav(favorites_count, statuses_count):
    if statuses_count > 0:
        return favorites_count / statuses_count
    else:
        return 0

def account_age_fav_ratio(data):
    account_ages_favs = []

    for user_data in data:
        user_tweets = user_data['tweets']
        creation_date = user_tweets[0]['user']['created_at']
        favorites_count = user_tweets[0]['user']['favourites_count']
        statuses_count = user_tweets[0]['user']['statuses_count']

        # 计算账户年龄和平均收藏数比
        account_age = get_account_age(creation_date)
        avg_fav = get_avg_fav(favorites_count, statuses_count)

        account_ages_favs.append({
            'user_id': user_data['_id'],
            'account_age': account_age,
            'avg_fav': avg_fav
        })
    
    return account_ages_favs



11. `group_popularity`: Calculer la popularité d'un utilisateur dans un groupe

In [36]:
def calc_group_popularity(data):
    total_groups = 0
    temp_list = []  # 用于临时存储每个用户的listed_count和user_id

    # 收集所有用户的listed_count，并找出最大值
    for user in data:
        listed_count = user['tweets'][0]['user']['listed_count']
        user_id = user['_id']
        total_groups = max(total_groups, listed_count)
        temp_list.append((user_id, listed_count))
    
    # 初始化最终列表
    group_popularity_list = []
    
    # 计算每个用户的组受欢迎程度
    for user_id, listed_count in temp_list:
        if total_groups > 0:
            group_popularity = listed_count / total_groups
        else:
            group_popularity = 0
        # 将结果以字典形式添加到最终列表
        group_popularity_list.append({
            'user_id': user_id,
            'group_popularity': group_popularity,
        })
    
    return group_popularity_list


12. `mentions_freq`: Fréquence moyenne des utilisateurs mentionnant d'autres utilisateurs dans leurs tweets

In [38]:
def calc_mentions_freq(data):
    mentions_freq_list = []
    for user in data:
        tweets = user['tweets']
        total_mentions = sum(len(tweet['entities']['user_mentions']) for tweet in tweets)
        total_tweets = len(tweets)
        mentions_freq = total_mentions / total_tweets if total_tweets > 0 else 0
        mentions_freq_list.append({
            'user_id': user['_id'],
            'mentions_freq': mentions_freq
        })
    return mentions_freq_list

Calculer toutes les données et les écrire au format csv.

In [39]:
import json
import csv

# 加载数据
with open('output.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

aggressiveness_scores = calculate_aggressiveness(data)
averages_scores = calculate_averages(data)
ratio_followers_friend = calc_ration_followers_friends(data)
avg_tweet_length = calc_avg_tweet_length(data)
rate_replied = calc_rate_replied(data)
tweet_frequency = calc_tweet_frequency_per_day(data)
visibility_et_verified = visibility_et_verified_check(data)
age_fav_ratio = account_age_fav_ratio(data)
group_popularity = calc_group_popularity(data)
mentions_freq = calc_mentions_freq(data)

combined_list = [{**a, **b, **c, **d, **e, **f, **g, **h, **i, **j} for a, b, c, d, e, f, g, h, i, j in zip(aggressiveness_scores, averages_scores, ratio_followers_friend, avg_tweet_length, rate_replied, tweet_frequency, visibility_et_verified, age_fav_ratio, group_popularity, mentions_freq)]

# 打开一个新的CSV文件并准备写入数据
with open('user_twitter_data.csv', 'w', newline='', encoding='utf-8') as file:
    fieldnames = ['user_id', 'aggressiveness_score', 'avg_retweet', 'avg_url', 'avg_hashtag', 'ratio_followers_friend', 'avg_tweet_length', 'repliedrate_replied', 'tweet_frequency', 'verified', 'visibility', 'account_age', 'avg_fav', 'group_popularity', 'mentions_freq']
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()  # 写入表头

    for user in combined_list:
        row = {
            'user_id': user['user_id'],
            'aggressiveness_score': user['aggressiveness_score'],
            'avg_retweet': user['avg_retweet'],
            'avg_url': user['avg_url'],
            'avg_hashtag': user['avg_hashtag'],
            'ratio_followers_friend': user['ratio'],
            'avg_tweet_length': user['avg_tweet_length'],
            'repliedrate_replied': user['replied_rate'],
            'tweet_frequency': user['frequency'],
            'verified': user['verification'],
            'visibility': user['visibility'],
            'account_age': user['account_age'],
            'avg_fav': user['avg_fav'],
            'group_popularity': user['group_popularity'],
            'mentions_freq': user['mentions_freq']
        }
        writer.writerow(row)  # 写入一行数据

print("Data has been successfully written to 'user_twitter_data.csv'.")



Data has been successfully written to 'user_twitter_data.csv'.
