### Inspiration

A large number of datasets are uploaded on Kaggle daily. However, not all of them are actually getting noticed and upvoted.

The first goal of this notebook is to find out what attributes might bring additional upvotes for a dataset on Kaggle. The second goal is to potentially help in improving the overall quality of the uploaded datasets by Kagglers.

#### This notebook content: 

* [1. General statistical overview of the Kaggle datasets](#section-one)
* [2. Correlation between the number of views/downloads/kernels and the upvotes](#section-two)
* [3. Dataset usability rating and the upvotes](#section-three)
- [3.1 Tags](#subsection-one)  
- [3.2 Licenses](#subsection-two)
- [3.3 Description & subtitle](#subsection-three)
* [4. Number of dataset versions and the upvotes](#section-four)
* [5. Dataset age and the upvotes](#section-five)
* [6. Dataset author performance tier and the upvotes](#section-six)
* [7. Hyped topics and the upvotes](#section-seven)
* [Conclusion](#section-eight)

In [None]:
#general data work
import numpy as np
import pandas as pd
import datetime as dt
from datetime import date, timedelta

#visuals
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py

In [None]:
# Load the data 

datasets = pd.read_csv('/kaggle/input/meta-kaggle/Datasets.csv', low_memory=False)
dataset_versions = pd.read_csv('/kaggle/input/meta-kaggle/DatasetVersions.csv')
dataset_tags = pd.read_csv('/kaggle/input/meta-kaggle/DatasetTags.csv')
users = pd.read_csv('/kaggle/input/meta-kaggle/Users.csv')
tags = pd.read_csv('/kaggle/input/meta-kaggle/Tags.csv')

<a id="section-one"></a>
### 1. General statistical overview of the Kaggle datasets

*Important note:* all datasets in the Meta Kaggle are public. There is no access to private datasets on Kaggle and it does make sense. 

The very first dataset was uploaded to Kaggle in 2015. Since then the number of datasets was increasing dramatically year by year as follows:


In [None]:
# work with time formats
datasets['CreationDate'] = pd.to_datetime(datasets['CreationDate'] )

# extract year
datasets['Year_uploaded'] = datasets['CreationDate'].dt.year

In [None]:
#plotting a number of datasets uploaded year by year
upload_years = datasets.groupby('Year_uploaded')['Id'].count().sort_values(ascending = False).head(6).reset_index()

x = upload_years['Year_uploaded']
y = upload_years['Id']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])
fig.update_layout(title_text='Kaggle Datasets Upload (year-by-year)', title_x=0.5)
fig.update_yaxes(title='Number of Datasets')

fig.show()

Dataset uploads significantly rose year by year (especially in 2020) and this is expected as the machine learning related disciplines became more popular recently and the number of Kaggle activities grew as a result. 

Let's see other statistics:

In [None]:
#Total number of datasets 
total_datasets = datasets["Id"].count()
print('Total number of datasets:',total_datasets)

#number of upvoted datasets (1+ upvote)
upvoted_datasets = len(datasets[ datasets['TotalVotes'] > 0])
print('The number of datasets with  1+ upvotes:', upvoted_datasets)

#number of highly upvoted datasets (10+ upvotes)
upvoted_datasets_10 = len(datasets[ datasets['TotalVotes'] > 10])
print('The number of datasets with  10+ upvotes:', upvoted_datasets_10)

#number of datasets with 1 to 10 upvotes)
one_to_ten_upvotes = upvoted_datasets - upvoted_datasets_10
print('The number of datasets with 1 to 10 upvotes:', one_to_ten_upvotes)

#number of datasets with no upvotes)
total_no_upvotes = total_datasets - upvoted_datasets
print('The number of datasets with 0 upvotes:', total_no_upvotes)

In [None]:
#making a donut chart for the dataset upvotes
labels = ['Datasets with 1 to 10 upvotes', 'Datasets with 10+ upvotes', 'Datasets with no upvotes']
values = [one_to_ten_upvotes, upvoted_datasets_10, total_no_upvotes]

fig = go.Figure(data=[go.Pie(labels=labels, values=values, title = "UPVOTES", hole=.6)])
fig.update_layout(title_text="Overview of the datasets' upvotes")
fig.show()

As seen, 59% of all datasets uploaded on Kaggle aren't getting any votes at all and only 8% of datasets are getting more than 10 votes. 

Generally, the better quality of a dataset, the more upvotes it has. There are some factors that potentially influence the number of dataset upvotes. I am going to look closer at these. 

<a id="section-two"></a>
### 2. Correlation between the number of views/downloads/kernels and the upvotes

In [None]:
#upvotes & views correlation on a scatter plot
fig = px.scatter(datasets, x="TotalViews", y="TotalVotes", trendline="ols")
fig.update_layout(title_text='Dataset Views and Upvotes', title_x=0.5)
fig.show()

In [None]:
#upvotes & downloads correlation on a scatter plot
fig = px.scatter(datasets, x="TotalDownloads", y="TotalVotes", trendline="ols")
fig.update_layout(title_text='Dataset Downloads and Upvotes', title_x=0.5)
fig.show()

In [None]:
#upvotes & kernels correlation on a scatter plot
fig = px.scatter(datasets, x="TotalKernels", y="TotalVotes", trendline="ols")
fig.update_layout(title_text='Dataset Kernels and Upvotes', title_x=0.5)
fig.show()

There is pretty much a linear dependency between the number of upvotes and the number of downloads/views/kernels related to this dataset. However, that might mean at least 2 things:

- on the one hand, if the dataset author creates its own kernels and does something in order to get as many views on their dataset as possible, that might potentially bring some additional upvotes
- on the other hand, this can be a "[reverse causality](https://www.statisticshowto.com/reverse-causality/#:~:text=Reverse%20causality%20means%20that%20X,exposure%20causes%20the%20risk%20factor.)" issue. Better quality datasets with upvotes - get more views, they're downloaded frequently and more Kagglers make kernels on them as a result

So I will try to find some other dataset quality variables 

<a id="section-three"></a>
### 3. Dataset usability rating and the upvotes

The usability rating is assigned to a dataset based on certain criteria and has a range from 1 (lowest) to 10 (highest). The usability rating was recently introduced on Kaggle and unfortunately, the Meta Kaggle dataset does not contain any values on it. 

However, the rating is calculated in some way and it depends on the information categories that are filled out by the dataset author. The more categories are filled out - the higher the rating.  

The main categories are *Tags, License, Description, Subtitle*

<a id="subsection-one"></a>
####    3.1 Tags 

Is there a correlation between the existence of tags on a dataset and its upvotes?

In [None]:
#data prep & merging
dataset_tags['Number_of_tags_used'] = dataset_tags.groupby(['DatasetId'])['Id'].transform('count')
datasets.insert(2, 'Number_of_tags_used', datasets['Id'].map(dataset_tags.drop_duplicates('DatasetId').set_index('DatasetId')['Number_of_tags_used']))
#converting all NaN values into zeros
datasets['Number_of_tags_used'] = datasets['Number_of_tags_used'].fillna(0)
#renaming cols
datasets['Number_of_tags_used'] = datasets['Number_of_tags_used'].replace({0: '0 tags', 1: '1 tags', 2: '2 tags', 3: '3 tags', 4: '4 tags', 5: '5 tags', 6: '6 tags', 7: '7 tags', 8: '8 tags', 9: '9 tags', 10: '10 tags', 11: '11 tags', 12: '12 tags', 13: '13 tags'})

In [None]:
#bar plotting the number of tags in a datasets across all Kaggle datasets
tags = datasets.groupby('Number_of_tags_used')['Id'].count().sort_values(ascending = False).reset_index()

x = tags['Number_of_tags_used']
y = tags['Id']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])
fig.update_layout(title_text='Number of tags used across all datasets on Kaggle (total)', title_x=0.5, width=880)
fig.update_yaxes(title='Number of Datasets')
fig.show()

As seen most of the datasets on Kaggle don't have any tags (34k). Also, only a few datasets have more than 8 tags. Let's now see how the number of dataset tags affects the average number of upvotes.

In [None]:
#bar plotting the number of tags in datasets by the average upvotes achieved
tags_votes = datasets.groupby('Number_of_tags_used')['TotalVotes'].mean().sort_values(ascending = False).reset_index().round(1)

x = tags_votes['Number_of_tags_used']
y = tags_votes['TotalVotes']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])

fig.update_traces(marker_color='rgb(46, 137, 162)', opacity=1)

fig.update_layout(title_text='Number of tags used on a dataset and the average Upvotes', title_x=0.5, width=880)
fig.update_yaxes(title='Average Upvotes')
fig.show()

Seems like there is a correlation between the existence of a tag on a dataset and the average upvotes the dataset obtains. Datasets with 0 tags on a dataset actually have the least average upvotes even though there are many more datasets on Kaggle with no tags. 

The number of tags used is also somewhat correlated to average upvotes with the highest mean being at 13 tags. 

So the obvious conclusion here is to put at least a couple of tags on a dataset in order to attract views and probably get some upvotes.

<a id="subsection-two"></a>
#### 3.2 License

Is there a correlation between the existence of license type on a dataset and its upvotes?

In [None]:
#data prep & merging
datasets.insert(2, 'LicenseName', datasets['Id'].map(dataset_versions.drop_duplicates('DatasetId').set_index('DatasetId')['LicenseName']))

In [None]:
#horizontal bar plotting the dataset license types by popularity on Kaggle
licenses = datasets.groupby('LicenseName')['Id'].count().sort_values(ascending = True).reset_index()

x = licenses['Id']
y = licenses['LicenseName']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=x,orientation='h',textposition='auto')])
fig.update_layout(title_text='Popular license types across all Kaggle datasets', title_x=0.5, width = 1100, height = 1500)
fig.update_xaxes(title='Number of datasets')
fig.update_yaxes(title='')

fig.show()

The unknown - means the author hasn't specified a license for a dataset and left it blank. As seen, most of the datasets on Kaggle do not have the license mentioned and this is expected. Let's now see what license types have the most upvotes on average

In [None]:
#horizontal bar plotting the dataset licenses by the average upvotes
licenses_votes = datasets.groupby('LicenseName')['TotalVotes'].mean().sort_values(ascending = True).reset_index().round(1)

x = licenses_votes['TotalVotes']
y = licenses_votes['LicenseName']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=x,orientation='h',textposition='auto')])

fig.update_traces(marker_color='rgb(46, 137, 162)', opacity=1)

fig.update_layout(title_text='License types that have the most average upvotes', title_x=0.5, width = 1200, height = 1500)
fig.update_xaxes(title='Average Upvotes')
fig.update_yaxes(title='')

fig.show()

Interesting that "Reddit API" datasets have the most average upvotes and there are only 40+ datasets with that license. The "Unknown" or no license datasets get the least average upvotes besides the fact that there are many more datasets with the "Uknown" license on Kaggle.

Seems like license specifying has a dependency on the average upvotes, and it is better to at least put some license type when uploading the dataset and not leaving it blank. It also looks fair enough because better quality datasets are made by more professional and responsible kagglers who usually specify the license. 

<a id="subsection-three"></a>
#### 3.3 Description & subtitle

Is there a correlation between the existence of the dataset description & subtitle and its upvotes?

In order to do that I first insert the subtitles & descritptions into the datasets dataframe and then convert the values as follows: 
- if there is a text element in the subtitle/descritption col, that would be "subtitle/descritption exists" 
- if there is a null value in the subtitle/descritption col, that would be "no subtitle/descritption" 

In [None]:
#data prep & merging
datasets.insert(2, 'Subtitle', datasets['Id'].map(dataset_versions.drop_duplicates('DatasetId').set_index('DatasetId')['Subtitle']))
datasets.insert(2, 'Description', datasets['Id'].map(dataset_versions.drop_duplicates('DatasetId').set_index('DatasetId')['Description']))

#rename all elements in "Subtitle" col into 2 categories
datasets.loc[datasets['Subtitle'].notnull(), 'Subtitle'] = 'Subtitle Exists'
datasets.loc[datasets['Subtitle'].isnull(), 'Subtitle'] = 'No Subtitle'

#rename all elements in "Description" col into 2 categories
datasets.loc[datasets['Description'].notnull(), 'Description'] = 'Description Exists'
datasets.loc[datasets['Description'].isnull(), 'Description'] = 'No Description'


In [None]:
#plotting the number of datasets with Subtitle
subtitles = datasets.groupby('Subtitle')['Id'].count().sort_values(ascending = False).reset_index().round(1)
x = subtitles['Subtitle']
y = subtitles['Id']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])

fig.update_traces(marker_color='rgb(151, 174, 186)', marker_line_color='rgb(8,48,107)', marker_line_width=1.5, opacity=1)

fig.update_layout(title_text='Subtitles across all Kaggle Datasets', title_x=0.5)
fig.update_xaxes(title='')
fig.update_yaxes(title='Number of Datasets')

fig.show()

In [None]:
#plotting a number of datasets with Description
descriptions = datasets.groupby('Description')['Id'].count().sort_values(ascending = False).reset_index().round(1)
x = descriptions['Description']
y = descriptions['Id']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])

fig.update_traces(marker_color='rgb(151, 174, 186)', marker_line_color='rgb(8,48,107)', marker_line_width=1.5, opacity=1)

fig.update_layout(title_text='Descriptions across all Kaggle Datasets', title_x=0.5)

fig.update_xaxes(title='')
fig.update_yaxes(title='Number of Datasets')

fig.show()

It is clear that most of the datasets on Kaggle don't have descriptions and subtitles on them. Let's now see how the existence of description/subtitle affects the number of upvotes

In [None]:
#box plotting the existence of a subtitle on a dataset and its upvotes
fig = px.box(datasets, x="Subtitle", y="TotalVotes")
fig.update_layout(title_text='Subtitle existence in a dataset and its Upvotes (total)', title_x=0.5)
fig.show()

In [None]:
#box plotting the existence of a description on a dataset and its upvotes
fig = px.box(datasets, x="Description", y="TotalVotes")
fig.update_layout(title_text='Description existence in a dataset and its Upvotes (total)', title_x=0.5)
fig.show()

As seen, datasets with descriptions and subtitles achieve more upvotes in total. Even though the majority of the datasets on Kaggle don't have both description and subtitle. Seems legit as from the usability and quality standpoints it is good to put a subtitle and description for a dataset, so other kagglers can see more details about it in a convenient way.

The obvious conclusion here is to definitely put a description & a subtitle when uploading a dataset to Kaggle in order to make it decent and get more upvotes. 

In general, seems like all dataset usability categories affect the upvotes, and thus it is better to treat it seriously and document the dataset well if the upvotes needed. 

<a id="section-four"></a>
### 4. Number of dataset versions and the upvotes

The assumption: datasets that are frequently updated (have more versions) - achieve more upvotes. In order to see it, I will get the maximum number of versions from a dataset_versions table and will try to see if this is the case.

In [None]:
#data prep & merging
versions_num = dataset_versions.groupby('DatasetId')['VersionNumber'].max().reset_index()
datasets.insert(2, 'Number_of_dataset_versions', datasets['Id'].map(versions_num.drop_duplicates('DatasetId').set_index('DatasetId')['VersionNumber']))

#converting all NaN values into one
datasets['Number_of_dataset_versions'] = datasets['Number_of_dataset_versions'].fillna(1)

In [None]:
#number of the dataset versions & upvotes correlation on a scatter plot

fig = px.scatter(datasets, x='Number_of_dataset_versions', y="TotalVotes", trendline="ols")
fig.update_layout(title_text='Number of dataset versions and Upvotes', title_x=0.5)
fig.show()

In [None]:
#correlation coeff between number of versions and upvotes
data = [
    go.Heatmap(
        z= datasets[['TotalVotes', 'Number_of_dataset_versions']].corr().values,
        x=datasets[['TotalVotes', 'Number_of_dataset_versions']].columns.values,
        y=datasets[['TotalVotes', 'Number_of_dataset_versions']].columns.values,
        colorscale='viridis',
        reversescale = False,
        opacity = 1.0 )
]

layout = go.Layout(
    title='Correlation: number of dataset versions and upvotes',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 800, height = 600)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')


It seems that there is no correlation between the number of dataset versions and the upvotes. That looks fair as datasets aren't like kernels' code that always has some room for improvement. Once the good dataset created - it can stay untouched for a while getting the upvotes over time as more and more Kagglers use it for their kernels

<a id="section-five"></a>
### 5. Dataset age and the upvotes

Is there a correlation between how long the dataset stays on Kaggle and its upvotes?

In [None]:
#create a new col for the days age since dataset creation
datasets['Today'] = dt.datetime.now().date()
datasets['Today'] = pd.to_datetime(datasets['Today'])
datasets['Dataset_age_days'] = (datasets['Today'] - datasets['CreationDate']).dt.days

#dropping today col
datasets = datasets.drop(['Today'], axis=1)

In [None]:
#number of days dataset stays on Kaggle & its upvotes correlation on a scatter plot

fig = px.scatter(datasets, x='Dataset_age_days', y='TotalVotes', trendline="ols")
fig.update_layout(title_text='Dataset age (days) and Upvotes', title_x=0.5)
fig.show()

Doesn't seem very much correlated, but if I look into the average dataset upvotes by years uploaded, the picture is a bit different:

In [None]:
#bar plotting the average upvotes and the dataset creation by years
years = datasets.groupby('Year_uploaded')['TotalVotes'].mean().reset_index().round(2)

x = years['Year_uploaded']
y = years['TotalVotes']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])

fig.update_traces(marker_color='rgb(46, 137, 162)', opacity=1)

fig.update_layout(title_text='Datasets uploaded by years and their average Upvotes', title_x=0.5)
fig.update_xaxes(title='Year uploaded')
fig.update_yaxes(title='Average Upvotes')

fig.show()

At the beginning of this notebook, it was shown that more and more datasets are uploaded on Kaggle every year. From the last figure, it is seen that the older datasets on average are getting more upvotes over time. 

At first glance it looks like the longer it takes for a dataset to stay on Kaggle, the more average upvotes it gets. On the other hand, it might not be the case because so many users just upload datasets of any quality waiting for upvotes and the quantity of uploads is increasing dramatically as a result (especially in 2020). However, quantity isn't equal to quality and probably older datasets were much better quality-wise than most of the new ones. That might be the reason on why on average older datastes get better upvotes: fewer & better quality vs. many with the worse quality.

My guess would be: if the dataset is of a good usability standard and quality, it would still have a high chance of obtaining more upvotes over time. 

<a id="section-six"></a>
### 6. Dataset author performance tier and the upvotes

Does the performance tier of the dataset author affect the number of upvotes?

There are 6 tiers on Kaggle and the way they presented in the user table is as follows:

0 - Novice

1 - Contributor

2 - Expert 

3 - Master

4 - Grandmaster 

6 - Kaggle Team

Let's see how they generally distributed over the Kaggle datasets:

In [None]:
#data prep & merging
datasets.insert(2, 'Author_tier', datasets['CreatorUserId'].map(users.drop_duplicates('Id').set_index('Id')['PerformanceTier']))
#dropping all nan values
datasets['Author_tier'] = datasets['Author_tier'].fillna(0)
#renaming cols
datasets['Author_tier'] = datasets['Author_tier'].replace({0: 'Novice', 1: 'Contributor', 2: 'Expert', 3: 'Master', 4: 'Grandmaster', 5: 'Kaggle Team'})

In [None]:
#bar plotting the distribution of users' performance tiers across all Kaggle datasets
user_tiers = datasets.groupby('Author_tier')['Id'].count().sort_values(ascending = False).reset_index().round(1)

x = user_tiers['Author_tier']
y = user_tiers['Id']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])
fig.update_layout(title_text='Author performance tiers across all Kaggle datasets (totals)', title_x=0.5)
fig.update_yaxes(title='Number of Datasets')

fig.show()

In [None]:
#donut chart for the distribution of datasets across users' performance tiers
labels = user_tiers['Author_tier']
values = user_tiers['Id']

fig = go.Figure(data=[go.Pie(labels=labels, values=values, title = "TIERS", hole=.6)])
fig.update_layout(title_text='User performance tiers across all Kaggle datasets (percentage)', height = 650)
fig.show()

Most of the datasets are uploaded by novices. Now let's see who gets the most of the average upvotes:

In [None]:
#bar plotting author tiers and the average upvotes
user_tiers_upvotes = datasets.groupby('Author_tier')['TotalVotes'].mean().sort_values(ascending = True).reset_index().round(1)

x = user_tiers_upvotes['Author_tier']
y = user_tiers_upvotes['TotalVotes']

fig = go.Figure(data=[go.Bar(x=x, y=y,text=y,textposition='auto')])
fig.update_layout(title_text='Dataset author tier and the average Upvotes', title_x=0.5)
fig.update_yaxes(title='Average Upvotes')

fig.show()

On the above graph it is exactly the opposite: the higher the author tier - the better average upvotes are. That is expected because more experienced authors create better datasets and have more followers. Not absolutly clear how these two variables influence each other.

<a id="section-seven"></a>
### 7. Hyped topics and the upvotes

By looking at the datasets search page, sometimes it seems that the most upvoted datasets on Kaggle are usually on hyped/hot topics (COVID19 etc). Let's look at the most upvoted datasets uploaded in the last 3 years and check if that is correct.

In [None]:
#data prep & merging
datasets.insert(2, 'Title', datasets['Id'].map(dataset_versions.drop_duplicates('DatasetId').set_index('DatasetId')['Title']))

#### Top 20 upvoted datasets uploaded in 2020:

In [None]:
# getting the top upvoted datasets in 2020
upvoted_2020 = datasets[['Title', 'TotalVotes', 'Year_uploaded']].sort_values(by='TotalVotes', ascending = False)
upvoted_2020 = upvoted_2020[upvoted_2020['Year_uploaded']==2020]
upvoted_2020 = upvoted_2020.drop(['Year_uploaded'], axis=1)
upvoted_2020.head(20).reset_index(drop=True)

#### Top 20 upvoted datasets uploaded in 2019:

In [None]:
# getting the top upvoted datasets in 2019
upvoted_2019 = datasets[['Title', 'TotalVotes', 'Year_uploaded']].sort_values(by='TotalVotes', ascending = False)
upvoted_2019 = upvoted_2019[upvoted_2019['Year_uploaded']==2019]
upvoted_2019 = upvoted_2019.drop(['Year_uploaded'], axis=1)
upvoted_2019.head(20).reset_index(drop=True)

#### Top 20 upvoted datasets uploaded in 2018:

In [None]:
# getting the top upvoted datasets in 2018
upvoted_2018 = datasets[['Title', 'TotalVotes', 'Year_uploaded']].sort_values(by='TotalVotes', ascending = False)
upvoted_2018 = upvoted_2018[upvoted_2018['Year_uploaded']==2018]
upvoted_2018 = upvoted_2018.drop(['Year_uploaded'], axis=1)
upvoted_2018.head(20).reset_index(drop=True)

In 2020, COVID-19 related datasets actually had the most upvotes, but for other years the dataset topics are well distributed across various themes. So probably in order to obtain the upvotes, the dataset can be of any theme as long as it is good in terms of quality and usability.

<a id="section-eight"></a>
### Conclusion

- There is a strong correlation between the number of upvotes and the number of kernels/downloads/views associated with that dataset. However, the way they affect each other isn't 100% clear
 
- The average number of upvotes for datasets with at least 1 tag is higher than the average number of upvotes for datasets without tags. Also the more tags - the higher the average upvotes number. 
 
- The average number of upvotes for datasets with a specified license type is higher than the average number of upvotes for datasets without the license. 

- The average number of upvotes for datasets with both description and subtitle is higher than the average number of upvotes for datasets without any of these. 

- There is no correlation between the number of upvotes and the number of dataset updates
 
- There is no direct correlation between the number of days the dataset stays on Kaggle and the upvotes. However, the average number of upvotes for older datasets is higher (most likely because there were too many datasets with low upvotes uploaded recently)
 
- The average number of upvotes for datasets that were uploaded by the authors with higher performance tiers is higher than the authors with lower tiers

- Popular topics are quite well distributed across the most upvoted datasets in the last 3 years

It is really difficult to create a popular dataset, that is noticed and upvoted by Kagglers. Nearly 60% of uploaded datasets aren't getting any upvotes and only 8% of datasets achieve more than 10 upvotes.

Views, kernels, and downloads are correlated to upvotes. So potentially it's a good idea to add a few kernels to your dataset and try to attract some views to it.

In terms of dataset usability, it is strongly advised to add tags, license type, description, and subtitles for more upvotes.

Frequently updating your dataset isn't really helpful in terms of upvotes. Although, if the dataset is of good quality - it might gain some additional upvotes with time. 

Probably it doesn't really matter if the dataset topic is currently on hype or not. More important is the quality and usability of a dataset and this tends to be more valued on Kaggle. 