# Professionals Dataset
On the US map, clicking on Virginia brings up a view of the professionals dataset filtered on Virginia.

There are 10 professionals per page, and navigation buttons allow 100 pages to be viewed. The professionals are sorted by total spend from highest to lowest. Total spend ranges from \$1.3B to \$590K at the bottom of the 100th page.

The page offers additional filters on the left. The professionals dataset can be further filtered by state, city, tags, total payments, worked in, and badges. The state filter includes each state and the number of professionals whose location is in the state. Adding it all up, there are roughly 100K entries (in the view of the professionals dataset filtered on Virginia).

The table below is a rough summary of the total spend data.

| Total Spend             | Number of Professionals |
|-------------------------|-------------------------|
|  $\geq \$1\mathrm{B}$   |            1            |
|  $\geq \$100\mathrm{M}$ |           12            |
|  $\geq \$10\mathrm{M}$  |          132            |
|  $\geq \$1\mathrm{M}$   |          577            |
|  $< \$1\mathrm{M}$      |         ~99K            |

# Example Dataset
The example dataset is based on the professionals dataset.

The example dataset includes 100K professionals. The table below shows the approach for defining the number of professionals at each level of total spend.

| Total Spend             | Number of Professionals  |
|-------------------------|--------------------------|
|  $\geq \$1\mathrm{B}$   |     random 3-9           |
|  $\geq \$100\mathrm{M}$ |     random 10-20         |
|  $\geq \$10\mathrm{M}$  |     random 100-300       |
|  $\geq \$1\mathrm{M}$   |     random 500-1K        |
|  $\geq \$100\mathrm{K}$ |     random 5K-10K        |
|  $\geq \$10\mathrm{K}$  |     random 20K-40K       |
|  $\geq \$1\mathrm{K}$   | # required to reach 100K |

# Code

In [1]:
import numpy
import pandas

Set the random number seed for repeatability.

In [2]:
numpy.random.seed(1234)

The example dataset includes 100K professionals.

In [3]:
total_number_of_professionals = 1e5
print(total_number_of_professionals)

100000.0


Level of total spend ranges from \$1B to \$1K. This can be written as $\$1 * 10^9$ to $\$1 * 10^3$. The order of magnitude ranges from 9 to 3.

In [4]:
orders_of_magnitude = numpy.arange(9,2,-1)
print(orders_of_magnitude)

[9 8 7 6 5 4 3]


Total Spend

In [5]:
total_spend = numpy.power(10, orders_of_magnitude)
print(total_spend)

[1000000000  100000000   10000000    1000000     100000      10000
       1000]


Number of Professionals

In [6]:
number_of_professionals = numpy.array([ numpy.random.randint(3, 10),
                                        numpy.random.randint(10, 21),
                                        numpy.random.randint(100, 300),
                                        numpy.random.randint(500, 1e3),
                                        numpy.random.randint(5e3, 10e3),
                                        numpy.random.randint(20e3, 40e3),
                                        0 ])
number_of_professionals[-1] = total_number_of_professionals - number_of_professionals.sum()
print(number_of_professionals)

[    6    16   153   704  5664 28471 64986]


Put the data into a pandas DataFrame.

In [7]:
number_of_professionals_table = pandas.DataFrame()
number_of_professionals_table['Total Spend'] = total_spend
number_of_professionals_table['Number of Professionals'] = number_of_professionals
print(number_of_professionals_table)

   Total Spend  Number of Professionals
0   1000000000                        6
1    100000000                       16
2     10000000                      153
3      1000000                      704
4       100000                     5664
5        10000                    28471
6         1000                    64986


| Total Spend             | Number of Professionals  |
|-------------------------|--------------------------|
|  $\geq \$1\mathrm{B}$   |         6                |
|  $\geq \$100\mathrm{M}$ |        16                |
|  $\geq \$10\mathrm{M}$  |       153                |
|  $\geq \$1\mathrm{M}$   |       704                |
|  $\geq \$100\mathrm{K}$ |      5664                |
|  $\geq \$10\mathrm{K}$  |     28471                |
|  $\geq \$1\mathrm{K}$   |     64986                |

For all 100K professionals, generate a random value for total spend.

In [8]:
example_datasets = []
for i, order_of_magnitude in enumerate(orders_of_magnitude):
    total_spend_min = numpy.power(10, order_of_magnitude)
    total_spend_max = numpy.power(10, order_of_magnitude+1)
    number_of_professionals = number_of_professionals_table.iloc[i]['Number of Professionals']
    example_dataset_i = pandas.DataFrame(numpy.random.randint(low=total_spend_min,
                                                              high=total_spend_max,
                                                              size=number_of_professionals),
                                         columns=['Total Spend'])
    example_dataset_i = example_dataset_i.sort_values(by='Total Spend', ascending=False)
    example_datasets.append(example_dataset_i)
example_dataset = pandas.concat(example_datasets, ignore_index=True)
print(example_dataset)

       Total Spend
0       8614232354
1       6482372250
2       4762102042
3       4061031881
4       3935450949
5       3151757680
6        970717899
7        886580755
8        750164161
9        732230259
10       717540492
11       697971689
12       669689594
13       610345219
14       604415490
15       595622784
16       593431527
17       423759947
18       274927782
19       274470609
20       266344698
21       263843691
22        99754705
23        98542325
24        97255251
25        97237865
26        95305780
27        94924612
28        94245020
29        93896780
...            ...
99970         1003
99971         1003
99972         1003
99973         1002
99974         1002
99975         1002
99976         1002
99977         1002
99978         1002
99979         1002
99980         1002
99981         1001
99982         1001
99983         1001
99984         1001
99985         1001
99986         1001
99987         1001
99988         1001
99989         1001
99990       

Save the example dataset to a file.

In [9]:
example_dataset.to_csv('project-files/example_dataset.csv')